Computer Vision and Pattern Recognition 130
☆ Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng
Recent video generation models can produce high-fidelity, temporally coherent
videos, indicating that they may encode substantial world knowledge. Beyond
realistic synthesis, they also exhibit emerging behaviors indicative of visual
perception, modeling, and manipulation. Yet, an important question still
remains: Are video models ready to serve as zero-shot reasoners in challenging
visual reasoning scenarios? In this work, we conduct an empirical study to
comprehensively investigate this question, focusing on the leading and popular
Veo-3. We evaluate its reasoning behavior across 12 dimensions, including
spatial, geometric, physical, temporal, and embodied logic, systematically
characterizing both its strengths and failure modes. To standardize this study,
we curate the evaluation data into MME-CoF, a compact benchmark that enables
in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our
findings reveal that while current video models demonstrate promising reasoning
patterns on short-horizon spatial coherence, fine-grained grounding, and
locally consistent dynamics, they remain limited in long-horizon causal
reasoning, strict geometric constraints, and abstract logic. Overall, they are
not yet reliable as standalone zero-shot reasoners, but exhibit encouraging
signs as complementary visual engines alongside dedicated reasoning models.
Project page: https://video-cof.github.io
comment: Project Page: https://video-cof.github.io
☆ OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
There are two prevalent ways to constructing 3D scenes: procedural generation
and 2D lifting. Among them, panorama-based 2D lifting has emerged as a
promising technique, leveraging powerful 2D generative priors to produce
immersive, realistic, and diverse 3D environments. In this work, we advance
this technique to generate graphics-ready 3D scenes suitable for physically
based rendering (PBR), relighting, and simulation. Our key insight is to
repurpose 2D generative models for panoramic perception of geometry, textures,
and PBR materials. Unlike existing 2D lifting approaches that emphasize
appearance generation and ignore the perception of intrinsic properties, we
present OmniX, a versatile and unified framework. Based on a lightweight and
efficient cross-modal adapter structure, OmniX reuses 2D generative priors for
a broad range of panoramic vision tasks, including panoramic perception,
generation, and completion. Furthermore, we construct a large-scale synthetic
panorama dataset containing high-quality multimodal panoramas from diverse
indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness
of our model in panoramic visual perception and graphics-ready 3D scene
generation, opening new possibilities for immersive and physically realistic
virtual world generation.
comment: Project page: https://yukun-huang.github.io/OmniX/
☆ Masked Diffusion Captioning for Visual Feature Learning EMNLP 2025
We learn visual features by captioning images with an image-conditioned
masked diffusion language model, a formulation we call masked diffusion
captioning (MDC). During training, text tokens in each image-caption pair are
masked at a randomly chosen ratio, and a decoder conditioned on visual features
is trained to reconstruct the original text. After training, the learned visual
features can be applied to downstream vision tasks. Unlike autoregressive
captioning, the strength of the visual learning signal in MDC does not depend
on each token's position in the sequence, reducing the need for auxiliary
objectives. Linear probing experiments across a variety of academic-scale
models and datasets show that the learned visual features are competitive with
those produced by autoregressive and contrastive approaches.
comment: EMNLP 2025 (Findings). Project page:
https://cfeng16.github.io/mdlm4vfl/
☆ SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, Ziwei Liu
Immersive applications call for synthesizing spatiotemporal 4D content from
casual videos without costly 3D supervision. Existing video-to-4D methods
typically rely on manually annotated camera poses, which are labor-intensive
and brittle for in-the-wild footage. Recent warp-then-inpaint approaches
mitigate the need for pose labels by warping input frames along a novel camera
trajectory and using an inpainting model to fill missing regions, thereby
depicting the 4D scene from diverse viewpoints. However, this
trajectory-to-trajectory formulation often entangles camera motion with scene
dynamics and complicates both modeling and inference. We introduce SEE4D, a
pose-free, trajectory-to-camera framework that replaces explicit trajectory
prediction with rendering to a bank of fixed virtual cameras, thereby
separating camera control from scene modeling. A view-conditional video
inpainting model is trained to learn a robust geometry prior by denoising
realistically synthesized warped images and to inpaint occluded or missing
regions across virtual viewpoints, eliminating the need for explicit 3D
annotations. Building on this inpainting core, we design a spatiotemporal
autoregressive inference pipeline that traverses virtual-camera splines and
extends videos with overlapping windows, enabling coherent generation at
bounded per-step complexity. We validate See4D on cross-view video generation
and sparse reconstruction benchmarks. Across quantitative metrics and
qualitative assessments, our method achieves superior generalization and
improved performance relative to pose- or trajectory-conditioned baselines,
advancing practical 4D world modeling from casual videos.
comment: 26 pages; 21 figures; 3 tables; project page:
https://see-4d.github.io/
☆ Scaling Image Geo-Localization to Continent Level NeurIPS 2025
Philipp Lindenberger, Paul-Edouard Sarlin, Jan Hosang, Matteo Balice, Marc Pollefeys, Simon Lynen, Eduard Trulls
Determining the precise geographic location of an image at a global scale
remains an unsolved challenge. Standard image retrieval techniques are
inefficient due to the sheer volume of images (>100M) and fail when coverage is
insufficient. Scalable solutions, however, involve a trade-off: global
classification typically yields coarse results (10+ kilometers), while
cross-view retrieval between ground and aerial imagery suffers from a domain
gap and has been primarily studied on smaller regions. This paper introduces a
hybrid approach that achieves fine-grained geo-localization across a large
geographic expanse the size of a continent. We leverage a proxy classification
task during training to learn rich feature representations that implicitly
encode precise location information. We combine these learned prototypes with
embeddings of aerial imagery to increase robustness to the sparsity of
ground-level data. This enables direct, fine-grained retrieval over areas
spanning multiple countries. Our extensive evaluation demonstrates that our
approach can localize within 200m more than 68\% of queries of a dataset
covering a large part of Europe. The code is publicly available at
https://scaling-geoloc.github.io.
comment: NeurIPS 2025
☆ The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu
Despite recent advances in 3D human motion generation (MoGen) on standard
benchmarks, existing models still face a fundamental bottleneck in their
generalization capability. In contrast, adjacent generative fields, most
notably video generation (ViGen), have demonstrated remarkable generalization
in modeling human behaviors, highlighting transferable insights that MoGen can
leverage. Motivated by this observation, we present a comprehensive framework
that systematically transfers knowledge from ViGen to MoGen across three key
pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a
large-scale dataset comprising 228,000 high-quality motion samples that
integrates high-fidelity optical MoCap data with semantically annotated motions
from web videos and synthesized samples generated by state-of-the-art ViGen
models. The dataset includes both text-motion pairs and text-video-motion
triplets, substantially expanding semantic diversity. Second, we propose
ViMoGen, a flow-matching-based diffusion transformer that unifies priors from
MoCap data and ViGen models through gated multimodal conditioning. To enhance
efficiency, we further develop ViMoGen-light, a distilled variant that
eliminates video generation dependencies while preserving strong
generalization. Finally, we present MBench, a hierarchical benchmark designed
for fine-grained evaluation across motion quality, prompt fidelity, and
generalization ability. Extensive experiments show that our framework
significantly outperforms existing approaches in both automatic and human
evaluations. The code, data, and benchmark will be made publicly available.
☆ HEIR: Learning Graph-Based Motion Hierarchies
Hierarchical structures of motion exist across research fields, including
computer vision, graphics, and robotics, where complex dynamics typically arise
from coordinated interactions among simpler motion components. Existing methods
to model such dynamics typically rely on manually-defined or heuristic
hierarchies with fixed motion primitives, limiting their generalizability
across different tasks. In this work, we propose a general hierarchical motion
modeling method that learns structured, interpretable motion relationships
directly from data. Our method represents observed motions using graph-based
hierarchies, explicitly decomposing global absolute motions into
parent-inherited patterns and local motion residuals. We formulate hierarchy
inference as a differentiable graph learning problem, where vertices represent
elemental motions and directed edges capture learned parent-child dependencies
through graph neural networks. We evaluate our hierarchical reconstruction
approach on three examples: 1D translational motion, 2D rotational motion, and
dynamic 3D scene deformation via Gaussian splatting. Experimental results show
that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases,
and produces more realistic and interpretable deformations compared to the
baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable,
data-driven hierarchical modeling paradigm, our method offers a formulation
applicable to a broad range of motion-centric tasks. Project Page:
https://light.princeton.edu/HEIR/
comment: Code link: https://github.com/princeton-computational-imaging/HEIR
☆ Clone Deterministic 3D Worlds with Geometrically-Regularized World Models
A world model is an internal model that simulates how the world evolves.
Given past observations and actions, it predicts the future of both the
embodied agent and its environment. Accurate world models are essential for
enabling agents to think, plan, and reason effectively in complex, dynamic
settings. Despite rapid progress, current world models remain brittle and
degrade over long horizons. We argue that a central cause is representation
quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or
entangled latents make dynamics learning unnecessarily hard. We therefore ask
whether improving representation learning alone can substantially improve
world-model performance. In this work, we take a step toward building a truly
accurate world model by addressing a fundamental yet open problem: constructing
a model that can fully clone and overfit to a deterministic 3D world. We
propose Geometrically-Regularized World Models (GRWM), which enforces that
consecutive points along a natural sensory trajectory remain close in latent
representation space. This approach yields significantly improved latent
representations that align closely with the true topology of the environment.
GRWM is plug-and-play, requires only minimal architectural modification, scales
with trajectory length, and is compatible with diverse latent generative
backbones. Across deterministic 3D settings and long-horizon prediction tasks,
GRWM significantly increases rollout fidelity and stability. Analyses show that
its benefits stem from learning a latent manifold with superior geometric
structure. These findings support a clear takeaway: improving representation
learning is a direct and useful path to robust world models, delivering
reliable long-horizon predictions without enlarging the dynamics module.
☆ ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Charts play an important role in visualization, reasoning, data analysis, and
the exchange of ideas among humans. However, existing vision-language models
(VLMs) still lack accurate perception of details and struggle to extract
fine-grained structures from charts. Such limitations in chart grounding also
hinder their ability to compare multiple charts and reason over them. In this
paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a
comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting
tabular data, localizing visualization elements, and recognizing various
attributes from charts of diverse types and complexities. We design a JSON
template to facilitate the calculation of evaluation metrics specifically
tailored for each grounding task. By incorporating a novel two-stage inference
workflow, the benchmark can further evaluate VLMs' capability to align and
compare elements/attributes across two charts. Our analysis of evaluations on
several recent VLMs reveals new insights into their perception biases,
weaknesses, robustness, and hallucinations in chart understanding. These
findings highlight the fine-grained discrepancies among VLMs in chart
understanding tasks and point to specific skills that need to be strengthened
in current models.
☆ Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
Age-related macular degeneration (AMD) is one of the leading causes of
irreversible vision impairment in people over the age of 60. This research
focuses on semantic segmentation for AMD lesion detection in RGB fundus images,
a non-invasive and cost-effective imaging technique. The results of the ADAM
challenge - the most comprehensive AMD detection from RGB fundus images
research competition and open dataset to date - serve as a benchmark for our
evaluation. Taking the U-Net connectivity as a base of our framework, we
evaluate and compare several approaches to improve the segmentation model's
architecture and training pipeline, including pre-processing techniques,
encoder (backbone) deep network types of varying complexity, and specialized
loss functions to mitigate class imbalances on image and pixel levels. The main
outcome of this research is the final configuration of the AMD detection
framework, which outperforms all the prior ADAM challenge submissions on the
multi-class segmentation of different AMD lesion types in non-invasive RGB
fundus images. The source code used to conduct the experiments presented in
this paper is made freely available.
☆ SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
This work introduces SteerVLM, a lightweight steering module designed to
guide Vision-Language Models (VLMs) towards outputs that better adhere to
desired instructions. Our approach learns from the latent embeddings of paired
prompts encoding target and converse behaviors to dynamically adjust
activations connecting the language modality with image context. This allows
for fine-grained, inference-time control over complex output semantics without
modifying model weights while preserving performance on off-target tasks. Our
steering module requires learning parameters equal to 0.14% of the original
VLM's size. Our steering module gains model control through dimension-wise
activation modulation and adaptive steering across layers without requiring
pre-extracted static vectors or manual tuning of intervention points.
Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a
multimodal dataset specifically created to facilitate the development and
evaluation of VLM steering techniques. Our method outperforms existing
intervention techniques on steering and hallucination mitigation benchmarks for
VLMs and proposes a robust solution for multimodal model control through
activation engineering.
☆ MORE: Multi-Organ Medical Image REconstruction Dataset
Shaokai Wu, Yapan Guo, Yanbiao Ji, Jing Tong, Yuxiang Lu, Mei Li, Suizhi Huang, Yue Ding, Hongtao Lu
CT reconstruction provides radiologists with images for diagnosis and
treatment, yet current deep learning methods are typically limited to specific
anatomies and datasets, hindering generalization ability to unseen anatomies
and lesions. To address this, we introduce the Multi-Organ medical image
REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies
with 15 lesion types. This dataset serves two key purposes: (1) enabling robust
training of deep learning models on extensive, heterogeneous data, and (2)
facilitating rigorous evaluation of model generalization for CT reconstruction.
We further establish a strong baseline solution that outperforms prior
approaches under these challenging conditions. Our results demonstrate that:
(1) a comprehensive dataset helps improve the generalization capability of
models, and (2) optimization-based methods offer enhanced robustness for unseen
anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our
project page https://more-med.github.io/
comment: Accepted to ACMMM 2025
☆ ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection
Paul F. R. Wilson, Mohamed Harmanani, Minh Nguyen Nhat To, Amoon Jamzad, Tarek Elghareb, Zhuoxin Guo, Adam Kinnaird, Brian Wodlinger, Purang Abolmaesumi, Parvin Mousavi
Purpose: Medical foundation models (FMs) offer a path to build
high-performance diagnostic systems. However, their application to prostate
cancer (PCa) detection from micro-ultrasound ({\mu}US) remains untested in
clinical settings. We present ProstNFound+, an adaptation of FMs for PCa
detection from {\mu}US, along with its first prospective validation. Methods:
ProstNFound+ incorporates a medical FM, adapter tuning, and a custom prompt
encoder that embeds PCa-specific clinical biomarkers. The model generates a
cancer heatmap and a risk score for clinically significant PCa. Following
training on multi-center retrospective data, the model is prospectively
evaluated on data acquired five years later from a new clinical site. Model
predictions are benchmarked against standard clinical scoring protocols
(PRI-MUS and PI-RADS). Results: ProstNFound+ shows strong generalization to the
prospective data, with no performance degradation compared to retrospective
evaluation. It aligns closely with clinical scores and produces interpretable
heatmaps consistent with biopsy-confirmed lesions. Conclusion: The results
highlight its potential for clinical deployment, offering a scalable and
interpretable alternative to expert-driven protocols.
☆ The Impact and Outlook of 3D Gaussian Splatting
Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed
the landscape of 3D scene representations, inspiring an extensive body of
associated research. Follow-up work includes analyses and contributions that
enhance the efficiency, scalability, and real-world applicability of 3DGS. In
this summary, we present an overview of several key directions that have
emerged in the wake of 3DGS. We highlight advances enabling resource-efficient
training and rendering, the evolution toward dynamic (or four-dimensional,
4DGS) representations, and deeper exploration of the mathematical foundations
underlying its appearance modeling and rendering process. Furthermore, we
examine efforts to bring 3DGS to mobile and virtual reality platforms, its
extension to massive-scale environments, and recent progress toward
near-instant radiance field reconstruction via feed-forward or distributed
computation. Collectively, these developments illustrate how 3DGS has evolved
from a breakthrough representation into a versatile and foundational tool for
3D vision and graphics.
comment: Article written for Frontiers of Science Award, International
Congress on Basic Science, 2025
☆ Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill
We present a long-term deployment study of a machine vision-based anomaly
detection system for failure prediction in a steel rolling mill. The system
integrates industrial cameras to monitor equipment operation, alignment, and
hot bar motion in real time along the process line. Live video streams are
processed on a centralized video server using deep learning models, enabling
early prediction of equipment failures and process interruptions, thereby
reducing unplanned breakdown costs. Server-based inference minimizes the
computational load on industrial process control systems (PLCs), supporting
scalable deployment across production lines with minimal additional resources.
By jointly analyzing sensor data from data acquisition systems and visual
inputs, the system identifies the location and probable root causes of
failures, providing actionable insights for proactive maintenance. This
integrated approach enhances operational reliability, productivity, and
profitability in industrial manufacturing environments.
☆ Improving Classification of Occluded Objects through Scene Context
The presence of occlusions has provided substantial challenges to
typically-powerful object recognition algorithms. Additional sources of
information can be extremely valuable to reduce errors caused by occlusions.
Scene context is known to aid in object recognition in biological vision. In
this work, we attempt to add robustness into existing Region Proposal
Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks
through two distinct scene-based information fusion techniques. We present one
algorithm under each methodology: the first operates prior to prediction,
selecting a custom object network to use based on the identified background
scene, and the second operates after detection, fusing scene knowledge into
initial object scores output by the RPN. We demonstrate our algorithms on
challenging datasets featuring partial occlusions, which show overall
improvement in both recall and precision against baseline methods. In addition,
our experiments contrast multiple training methodologies for occlusion
handling, finding that training on a combination of both occluded and
unoccluded images demonstrates an improvement over the others. Our method is
interpretable and can easily be adapted to other datasets, offering many future
directions for research and practical applications.
☆ BRIQA: Balanced Reweighting in Image Quality Assessment of Pediatric Brain MRI
Assessing the severity of artifacts in pediatric brain Magnetic Resonance
Imaging (MRI) is critical for diagnostic accuracy, especially in low-field
systems where the signal-to-noise ratio is reduced. Manual quality assessment
is time-consuming and subjective, motivating the need for robust automated
solutions. In this work, we propose BRIQA (Balanced Reweighting in Image
Quality Assessment), which addresses class imbalance in artifact severity
levels. BRIQA uses gradient-based loss reweighting to dynamically adjust
per-class contributions and employs a rotating batching scheme to ensure
consistent exposure to underrepresented classes. Through experiments, no single
architecture performs best across all artifact types, emphasizing the
importance of architectural diversity. The rotating batching configuration
improves performance across metrics by promoting balanced learning when
combined with cross-entropy loss. BRIQA improves average macro F1 score from
0.659 to 0.706, with notable gains in Noise (0.430), Zipper (0.098),
Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012)
artifact severity classification. The code is available at
https://github.com/BioMedIA-MBZUAI/BRIQA.
☆ Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2
Accurate estimation of sea ice drift is critical for Arctic navigation,
climate research, and operational forecasting. While optical flow, a computer
vision technique for estimating pixel wise motion between consecutive images,
has advanced rapidly in computer vision, its applicability to geophysical
problems and to satellite SAR imagery remains underexplored. Classical optical
flow methods rely on mathematical models and strong assumptions about motion,
which limit their accuracy in complex scenarios. Recent deep learning based
approaches have substantially improved performance and are now the standard in
computer vision, motivating their application to sea ice drift estimation. We
present the first large scale benchmark of 48 deep learning optical flow models
on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and
Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer
accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the
spatial scales of sea ice motion and typical navigation requirements in the
Arctic. Our results demonstrate that the models are capable of capturing
consistent regional drift patterns and that recent deep learning based optical
flow methods, which have substantially improved motion estimation accuracy
compared to classical methods, can be effectively transferred to polar remote
sensing. Optical flow produces spatially continuous drift fields, providing
motion estimates for every image pixel rather than at sparse buoy locations,
offering new opportunities for navigation and climate modeling.
☆ All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli, Abolfazl Razi
Autonomous Vehicles (AVs) are transforming the future of transportation
through advances in intelligent perception, decision-making, and control
systems. However, their success is tied to one core capability, reliable object
detection in complex and multimodal environments. While recent breakthroughs in
Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable
progress, the field still faces a critical challenge as knowledge remains
fragmented across multimodal perception, contextual reasoning, and cooperative
intelligence. This survey bridges that gap by delivering a forward-looking
analysis of object detection in AVs, emphasizing emerging paradigms such as
Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI
rather than re-examining outdated techniques. We begin by systematically
reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR,
and Radar) and their fusion strategies, highlighting not only their
capabilities and limitations in dynamic driving environments but also their
potential to integrate with recent advances in LLM/VLM-driven perception
frameworks. Next, we introduce a structured categorization of AV datasets that
moves beyond simple collections, positioning ego-vehicle, infrastructure-based,
and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a
cross-analysis of data structures and characteristics. Ultimately, we analyze
cutting-edge detection methodologies, ranging from 2D and 3D pipelines to
hybrid sensor fusion, with particular attention to emerging transformer-driven
approaches powered by Vision Transformers (ViTs), Large and Small Language
Models (SLMs), and VLMs. By synthesizing these perspectives, our survey
delivers a clear roadmap of current capabilities, open challenges, and future
opportunities.
☆ SAMRI: Segment Anything Model for MRI
Zhao Wang, Wei Dai, Thuy Thanh Dao, Steffen Bollmann, Hongfu Sun, Craig Engstrom, Shekhar S. Chandra
Accurate magnetic resonance imaging (MRI) segmentation is crucial for
clinical decision-making, but remains labor-intensive when performed manually.
Convolutional neural network (CNN)-based methods can be accurate and efficient,
but often generalize poorly to MRI's variable contrast, intensity
inhomogeneity, and protocols. Although the transformer-based Segment Anything
Model (SAM) has demonstrated remarkable generalizability in natural images,
existing adaptations often treat MRI as another imaging modality, overlooking
these modality-specific challenges. We present SAMRI, an MRI-specialized SAM
trained and validated on 1.1 million labeled MR slices spanning whole-body
organs and pathologies. We demonstrate that SAM can be effectively adapted to
MRI by simply fine-tuning its mask decoder using a two-stage strategy, reducing
training time by 94% and trainable parameters by 96% versus full-model
retraining. Across diverse MRI segmentation tasks, SAMRI achieves a mean Dice
of 0.87, delivering state-of-the-art accuracy across anatomical regions and
robust generalization on unseen structures, particularly small and clinically
important structures.
☆ PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus
To address the challenges in UAV object detection, such as complex
backgrounds, severe occlusion, dense small objects, and varying lighting
conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection
algorithm specifically designed for small objects in UAV imagery. In the
backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module
to enhance feature extraction for small objects. Additionally,we design the
Median-Frequency Feature Fusion (MFFF) module,which effectively improves the
model's ability to capture small-object details and contextual information.
Furthermore,we incorporate Focaler-SIoU to strengthen the model's bounding box
matching capability and increase its sensitivity to small-object features,
thereby further enhancing detection accuracy and robustness. Compared with
RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the
VisDrone2019 dataset with lower computational complexity and fewer parameters,
demonstrating its robustness and feasibility for small-object detection tasks.
☆ Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras
We propose tokenization of events and present a tokenizer, Spiking Patches,
specifically designed for event cameras. Given a stream of asynchronous and
spatially sparse events, our goal is to discover an event representation that
preserves these properties. Prior works have represented events as frames or as
voxels. However, while these representations yield high accuracy, both frames
and voxels are synchronous and decrease the spatial sparsity. Spiking Patches
gives the means to preserve the unique properties of event cameras and we show
in our experiments that this comes without sacrificing accuracy. We evaluate
our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and
object detection. Tokens from Spiking Patches yield inference times that are up
to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We
achieve this while matching their accuracy and even surpassing in some cases
with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for
object detection. Thus, tokenization constitutes a novel direction in
event-based vision and marks a step towards methods that preserve the
properties of event cameras.
☆ CYPRESS: Crop Yield Prediction via Regression on Prithvi's Encoder for Satellite Sensing
Accurate and timely crop yield prediction is crucial for global food security
and modern agricultural management. Traditional methods often lack the
scalability and granularity required for precision farming. This paper
introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi's Encoder
for Satellite Sensing), a deep learning model designed for high-resolution,
intra-field canola yield prediction. CYPRESS leverages a pre-trained,
large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for
a continuous regression task, transforming multi-temporal satellite imagery
into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from
the Canadian Prairies, CYPRESS demonstrates superior performance over existing
deep learning-based yield prediction models, highlighting the effectiveness of
fine-tuning foundation models for specialized agricultural applications. By
providing a continuous, high-resolution output, CYPRESS offers a more
actionable tool for precision agriculture than conventional classification or
county-level aggregation methods. This work validates a novel approach that
bridges the gap between large-scale Earth observation and on-farm
decision-making, offering a scalable solution for detailed agricultural
monitoring.
☆ ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Computational Super-Resolution (CSR) in fluorescence microscopy has, despite
being an ill-posed problem, a long history. At its very core, CSR is about
finding a prior that can be used to extrapolate frequencies in a micrograph
that have never been imaged by the image-generating microscope. It stands to
reason that, with the advent of better data-driven machine learning techniques,
stronger prior can be learned and hence CSR can lead to better results. Here,
we present ResMatching, a novel CSR method that uses guided conditional flow
matching to learn such improved data-priors. We evaluate ResMatching on 4
diverse biological structures from the BioSR dataset and compare its results
against 7 baselines. ResMatching consistently achieves competitive results,
demonstrating in all cases the best trade-off between data fidelity and
perceptual realism. We observe that CSR using ResMatching is particularly
effective in cases where a strong prior is hard to learn, e.g. when the given
low-resolution images contain a lot of noise. Additionally, we show that
ResMatching can be used to sample from an implicitly learned posterior
distribution and that this distribution is calibrated for all tested use-cases,
enabling our method to deliver a pixel-wise data-uncertainty term that can
guide future users to reject uncertain predictions.
comment: 5 pages, 4 figures
☆ Emu3.5: Native Multimodal Models are World Learners
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang
We introduce Emu3.5, a large-scale multimodal world model that natively
predicts the next state across vision and language. Emu3.5 is pre-trained
end-to-end with a unified next-token prediction objective on a corpus of
vision-language interleaved data containing over 10 trillion tokens, primarily
derived from sequential frames and transcripts of internet videos. The model
naturally accepts interleaved vision-language inputs and generates interleaved
vision-language outputs. Emu3.5 is further post-trained with large-scale
reinforcement learning to enhance multimodal reasoning and generation. To
improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA),
which converts token-by-token decoding into bidirectional parallel prediction,
accelerating per-image inference by about 20x without sacrificing performance.
Emu3.5 exhibits strong native multimodal capabilities, including long-horizon
vision-language generation, any-to-image (X2I) generation, and complex
text-rich image generation. It also exhibits generalizable world-modeling
abilities, enabling spatiotemporally consistent world exploration and
open-world embodied manipulation across diverse scenarios and tasks. For
comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image
(Nano Banana) on image generation and editing tasks and demonstrates superior
results on a suite of interleaved generation tasks. We open-source Emu3.5 at
https://github.com/baaivision/Emu3.5 to support community research.
comment: project page: https://emu.world
☆ CATCH: A Modular Cross-domain Adaptive Template with Hook
Recent advances in Visual Question Answering (VQA) have demonstrated
impressive performance in natural image domains, with models like LLaVA
leveraging large language models (LLMs) for open-ended reasoning. However,
their generalization degrades significantly when transferred to out-of-domain
scenarios such as remote sensing, medical imaging, or math diagrams, due to
large distributional shifts and the lack of effective domain adaptation
mechanisms. Existing approaches typically rely on per-domain fine-tuning or
bespoke pipelines, which are costly, inflexible, and not scalable across
diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for
cross-domain adaptation that improves the generalization of VQA models while
requiring minimal changes to their core architecture. Our key idea is to
decouple visual and linguistic adaptation by introducing two lightweight
modules: a domain classifier to identify the input image type, and a dual
adapter mechanism comprising a Prompt Adapter for language modulation and a
Visual Adapter for vision feature adjustment. Both modules are dynamically
injected via a unified hook interface, requiring no retraining of the backbone
model. Experimental results across four domain-specific VQA benchmarks
demonstrate that our framework achieves consistent performance gains without
retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on
MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH
provides a scalable and extensible approach to multi-domain VQA, enabling
practical deployment across diverse application domains.
☆ Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios
In real-world environments, AI systems often face unfamiliar scenarios
without labeled data, creating a major challenge for conventional scene
understanding models. The inability to generalize across unseen contexts limits
the deployment of vision-based applications in dynamic, unstructured settings.
This work introduces a Dynamic Context-Aware Scene Reasoning framework that
leverages Vision-Language Alignment to address zero-shot real-world scenarios.
The goal is to enable intelligent systems to infer and adapt to new
environments without prior task-specific training. The proposed approach
integrates pre-trained vision transformers and large language models to align
visual semantics with natural language descriptions, enhancing contextual
comprehension. A dynamic reasoning module refines predictions by combining
global scene cues and object-level interactions guided by linguistic priors.
Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and
Open Images demonstrate up to 18% improvement in scene understanding accuracy
over baseline models in complex and unseen environments. Results also show
robust performance in ambiguous or cluttered scenes due to the synergistic
fusion of vision and language. This framework offers a scalable and
interpretable approach for context-aware reasoning, advancing zero-shot
generalization in dynamic real-world settings.
comment: Preprint under review at IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI), 2025
☆ Comparative Analysis of Deep Learning Models for Olive Tree Crown and Shadow Segmentation Towards Biovolume Estimation
Wondimagegn Abebe Demissie, Stefano Roccella, Rudy Rossetto, Antonio Minnocci, Andrea Vannini, Luca Sebastiani
Olive tree biovolume estimation is a key task in precision agriculture,
supporting yield prediction and resource management, especially in
Mediterranean regions severely impacted by climate-induced stress. This study
presents a comparative analysis of three deep learning models U-Net,
YOLOv11m-seg, and Mask RCNN for segmenting olive tree crowns and their shadows
in ultra-high resolution UAV imagery. The UAV dataset, acquired over
Vicopisano, Italy, includes manually annotated crown and shadow masks. Building
on these annotations, the methodology emphasizes spatial feature extraction and
robust segmentation; per-tree biovolume is then estimated by combining crown
projected area with shadow-derived height using solar geometry. In testing,
Mask R-CNN achieved the best overall accuracy (F1 = 0.86; mIoU = 0.72), while
YOLOv11m-seg provided the fastest throughput (0.12 second per image). The
estimated biovolumes spanned from approximately 4 to 24 cubic meters,
reflecting clear structural differences among trees. These results indicate
Mask R-CNN is preferable when biovolume accuracy is paramount, whereas
YOLOv11m-seg suits large-area deployments where speed is critical; U-Net
remains a lightweight, high-sensitivity option. The framework enables accurate,
scalable orchard monitoring and can be further strengthened with DEM or DSM
integration and field calibration for operational decision support.
comment: 6 pages, 2025 IEEE International Workshop on Metrology for
Agriculture and Forestry (MetroAgriFor)
☆ AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
Advertisers commonly need multiple versions of the same advertisement (ad) at
varying durations for a single campaign. The traditional approach involves
manually selecting and re-editing shots from longer video ads to create shorter
versions, which is labor-intensive and time-consuming. In this paper, we
introduce a framework for automated video ad clipping using video summarization
techniques. We are the first to frame video clipping as a shot selection
problem, tailored specifically for advertising. Unlike existing general video
summarization methods that primarily focus on visual content, our approach
emphasizes the critical role of audio in advertising. To achieve this, we
develop a two-stream audio-visual fusion model that predicts the importance of
video frames, where importance is defined as the likelihood of a frame being
selected in the firm-produced short ad. To address the lack of ad-specific
datasets, we present AdSum204, a novel dataset comprising 102 pairs of
30-second and 15-second ads from real advertising campaigns. Extensive
experiments demonstrate that our model outperforms state-of-the-art methods
across various metrics, including Average Precision, Area Under Curve,
Spearman, and Kendall.
comment: Accepted at 32nd International Conference on MultiMedia Modeling
☆ SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging
Hao Xie, Zixun Huang, Yushen Zuo, Yakun Ju, Frank H. F. Leung, N. F. Law, Kin-Man Lam, Yong-Ping Zheng, Sai Ho Ling
Spine segmentation, based on ultrasound volume projection imaging (VPI),
plays a vital role for intelligent scoliosis diagnosis in clinical
applications. However, this task faces several significant challenges. Firstly,
the global contextual knowledge of spines may not be well-learned if we neglect
the high spatial correlation of different bone features. Secondly, the spine
bones contain rich structural knowledge regarding their shapes and positions,
which deserves to be encoded into the segmentation process. To address these
challenges, we propose a novel scale-adaptive structure-aware network
(SA$^{2}$Net) for effective spine segmentation. First, we propose a
scale-adaptive complementary strategy to learn the cross-dimensional
long-distance correlation features for spinal images. Second, motivated by the
consistency between multi-head self-attention in Transformers and semantic
level affinity, we propose structure-affinity transformation to transform
semantic features with class-specific affinity and combine it with a
Transformer decoder for structure-aware reasoning. In addition, we adopt a
feature mixing loss aggregation method to enhance model training. This method
improves the robustness and accuracy of the segmentation process. The
experimental results demonstrate that our SA$^{2}$Net achieves superior
segmentation performance compared to other state-of-the-art methods. Moreover,
the adaptability of SA$^{2}$Net to various backbones enhances its potential as
a promising tool for advanced scoliosis diagnosis using intelligent spinal
image analysis. The code and experimental demo are available at
https://github.com/taetiseo09/SA2Net.
comment: Accepted by Computerized Medical Imaging and Graphics (CMIG)
☆ Analysis of the Robustness of an Edge Detector Based on Cellular Automata Optimized by Particle Swarm
The edge detection task is essential in image processing aiming to extract
relevant information from an image. One recurring problem in this task is the
weaknesses found in some detectors, such as the difficulty in detecting loose
edges and the lack of context to extract relevant information from specific
problems. To address these weaknesses and adapt the detector to the properties
of an image, an adaptable detector described by two-dimensional cellular
automaton and optimized by meta-heuristic combined with transfer learning
techniques was developed. This study aims to analyze the impact of expanding
the search space of the optimization phase and the robustness of the
adaptability of the detector in identifying edges of a set of natural images
and specialized subsets extracted from the same image set. The results obtained
prove that expanding the search space of the optimization phase was not
effective for the chosen image set. The study also analyzed the adaptability of
the model through a series of experiments and validation techniques and found
that, regardless of the validation, the model was able to adapt to the input
and the transfer learning techniques applied to the model showed no significant
improvements.
☆ Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang
Self-improvement has emerged as a mainstream paradigm for advancing the
reasoning capabilities of large vision-language models (LVLMs), where models
explore and learn from successful trajectories iteratively. However, we
identify a critical issue during this process: the model excels at generating
high-quality trajectories for simple queries (i.e., head data) but struggles
with more complex ones (i.e., tail data). This leads to an imbalanced
optimization that drives the model to prioritize simple reasoning skills, while
hindering its ability to tackle more complex reasoning tasks. Over iterations,
this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew
effect"--which ultimately hinders further model improvement and leads to
performance bottlenecks. To counteract this challenge, we introduce four
efficient strategies from two perspectives: distribution-reshaping and
trajectory-resampling, to achieve head-tail re-balancing during the
exploration-and-learning self-improvement process. Extensive experiments on
Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks
demonstrate that our methods consistently improve visual reasoning
capabilities, outperforming vanilla self-improvement by 3.86 points on average.
comment: Preprint
☆ Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Object-context shortcuts remain a persistent challenge in vision-language
models, undermining zero-shot reliability when test-time scenes differ from
familiar training co-occurrences. We recast this issue as a causal inference
problem and ask: Would the prediction remain if the object appeared in a
different environment? To answer this at inference time, we estimate object and
background expectations within CLIP's representation space, and synthesize
counterfactual embeddings by recombining object features with diverse
alternative contexts sampled from external datasets, batch neighbors, or
text-derived descriptions. By estimating the Total Direct Effect and simulating
intervention, we further subtract background-only activation, preserving
beneficial object-context interactions while mitigating hallucinated scores.
Without retraining or prompt design, our method substantially improves both
worst-group and average accuracy on context-sensitive benchmarks, establishing
a new zero-shot state of the art. Beyond performance, our framework provides a
lightweight representation-level counterfactual approach, offering a practical
causal avenue for debiased and reliable multimodal reasoning.
☆ Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Few-shot anomaly detection (FSAD) methods identify anomalous regions with few
known normal samples. Most existing methods rely on the generalization ability
of pre-trained vision-language models (VLMs) to recognize potentially anomalous
regions through feature similarity between text descriptions and images.
However, due to the lack of detailed textual descriptions, these methods can
only pre-define image-level descriptions to match each visual patch token to
identify potential anomalous regions, which leads to the semantic misalignment
between image descriptions and patch-level visual anomalies, achieving
sub-optimal localization performance. To address the above issues, we propose
the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and
fine-grained textual descriptions for existing anomaly detection datasets with
automatic construction pipeline. Based on the MFSC, we propose a novel
framework named FineGrainedAD to improve anomaly localization performance,
which consists of two components: Multi-Level Learnable Prompt (MLLP) and
Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics
into multi-level learnable prompts through automatic replacement and
concatenation mechanism, while MLSA designs region aggregation strategy and
multi-level alignment training to facilitate learnable prompts better align
with corresponding visual regions. Experiments demonstrate that the proposed
FineGrainedAD achieves superior overall performance in few-shot settings on
MVTec-AD and VisA datasets.
comment: 12 pages, 7 figures
☆ PointSt3R: Point Tracking through 3D Grounded Correspondence
Recent advances in foundational 3D reconstruction models, such as DUSt3R and
MASt3R, have shown great potential in 2D and 3D correspondence in static
scenes. In this paper, we propose to adapt them for the task of point tracking
through 3D grounded correspondence. We first demonstrate that these models are
competitive point trackers when focusing on static points, present in current
point tracking benchmarks ($+33.5\%$ on EgoPoints vs. CoTracker2). We propose
to combine the reconstruction loss with training for dynamic correspondence
along with a visibility head, and fine-tuning MASt3R for point tracking using a
relatively small amount of synthetic data. Importantly, we only train and
evaluate on pairs of frames where one contains the query point, effectively
removing any temporal context. Using a mix of dynamic and static point
correspondences, we achieve competitive or superior point tracking results on
four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $\delta_{avg}$ / 85.8\%
occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and
significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs
82.8). We also present results on 3D point tracking along with several
ablations on training datasets and percentage of dynamic correspondences.
comment: http://rhodriguerrier.github.io/PointSt3R
☆ A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Test-time prompt tuning (TPT) has emerged as a promising technique for
adapting large vision-language models (VLMs) to unseen tasks without relying on
labeled data. However, the lack of dispersion between textual features can hurt
calibration performance, which raises concerns about VLMs' reliability,
trustworthiness, and safety. Current TPT approaches primarily focus on
improving prompt calibration by either maximizing average textual feature
dispersion or enforcing orthogonality constraints to encourage angular
separation. However, these methods may not always have optimal angular
separation between class-wise textual features, which implies overlooking the
critical role of angular diversity. To address this, we propose A-TPT, a novel
TPT framework that introduces angular diversity to encourage uniformity in the
distribution of normalized textual features induced by corresponding learnable
prompts. This uniformity is achieved by maximizing the minimum pairwise angular
distance between features on the unit hypersphere. We show that our approach
consistently surpasses state-of-the-art TPT methods in reducing the aggregate
average calibration error while maintaining comparable accuracy through
extensive experiments with various backbones on different datasets. Notably,
our approach exhibits superior zero-shot calibration performance on natural
distribution shifts and generalizes well to medical datasets. We provide
extensive analyses, including theoretical aspects, to establish the grounding
of A-TPT. These results highlight the potency of promoting angular diversity to
achieve well-dispersed textual features, significantly improving VLM
calibration during test-time adaptation. Our code will be made publicly
available.
comment: 23 pages, 14 figures
☆ LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation
Recently text-to-video generation has made impressive progress in producing
short, high-quality clips, but evaluating long-form outputs remains a major
challenge especially when processing complex prompts. Existing benchmarks
mostly rely on simplified prompts and focus on low-level metrics, overlooking
fine-grained alignment with prompts and abstract dimensions such as narrative
coherence and thematic expression. To address these gaps, we propose
LoCoT2V-Bench, a benchmark specifically designed for long video generation
(LVG) under complex input conditions. Based on various real-world videos,
LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating
elements like scene transitions and event dynamics. Moreover, it constructs a
multi-dimensional evaluation framework that includes our newly proposed metrics
such as event-level alignment, fine-grained temporal consistency, content
clarity, and the Human Expectation Realization Degree (HERD) that focuses on
more abstract attributes like narrative flow, emotional response, and character
development. Using this framework, we conduct a comprehensive evaluation of
nine representative LVG models, finding that while current methods perform well
on basic visual and temporal aspects, they struggle with inter-event
consistency, fine-grained alignment, and high-level thematic adherence, etc.
Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for
evaluating long-form complex text-to-video generation and highlights critical
directions for future method improvement.
☆ EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models
Existing EEG-driven image reconstruction methods often overlook spatial
attention mechanisms, limiting fidelity and semantic coherence. To address
this, we propose a dual-conditioning framework that combines EEG embeddings
with spatial saliency maps to enhance image generation. Our approach leverages
the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes
Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals
with visual semantics, while a ControlNet branch conditions generation on
saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves
a significant improvement in the quality of low- and high-level image features
over existing approaches. Simultaneously, strongly aligning with human visual
attention. The results demonstrate that attentional priors resolve EEG
ambiguities, enabling high-fidelity reconstructions with applications in
medical diagnostics and neuroadaptive interfaces, advancing neural decoding
through efficient adaptation of pre-trained diffusion models.
comment: Demo paper
☆ SPG-CDENet: Spatial Prior-Guided Cross Dual Encoder Network for Multi-Organ Segmentation
Multi-organ segmentation is a critical task in computer-aided diagnosis.
While recent deep learning methods have achieved remarkable success in image
segmentation, huge variations in organ size and shape challenge their
effectiveness in multi-organ segmentation. To address these challenges, we
propose a Spatial Prior-Guided Cross Dual Encoder Network (SPG-CDENet), a novel
two-stage segmentation paradigm designed to improve multi-organ segmentation
accuracy. Our SPG-CDENet consists of two key components: a spatial prior
network and a cross dual encoder network. The prior network generates coarse
localization maps that delineate the approximate ROI, serving as spatial
guidance for the dual encoder network. The cross dual encoder network comprises
four essential components: a global encoder, a local encoder, a symmetric
cross-attention module, and a flow-based decoder. The global encoder captures
global semantic features from the entire image, while the local encoder focuses
on features from the prior network. To enhance the interaction between the
global and local encoders, a symmetric cross-attention module is proposed
across all layers of the encoders to fuse and refine features. Furthermore, the
flow-based decoder directly propagates high-level semantic features from the
final encoder layer to all decoder layers, maximizing feature preservation and
utilization. Extensive qualitative and quantitative experiments on two public
datasets demonstrate the superior performance of SPG-CDENet compared to
existing segmentation methods. Furthermore, ablation studies further validate
the effectiveness of the proposed modules in improving segmentation accuracy.
☆ CorVS: Person Identification via Video Trajectory-Sensor Correspondence in a Real-World Warehouse
Worker location data is key to higher productivity in industrial sites.
Cameras are a promising tool for localization in logistics warehouses since
they also offer valuable environmental contexts such as package status.
However, identifying individuals with only visual data is often impractical.
Accordingly, several prior studies identified people in videos by comparing
their trajectories and wearable sensor measurements. While this approach has
advantages such as independence from appearance, the existing methods may break
down under real-world conditions. To overcome this challenge, we propose CorVS,
a novel data-driven person identification method based on correspondence
between visual tracking trajectories and sensor measurements. Firstly, our deep
learning model predicts correspondence probabilities and reliabilities for
every pair of a trajectory and sensor measurements. Secondly, our algorithm
matches the trajectories and sensor measurements over time using the predicted
probabilities and reliabilities. We developed a dataset with actual warehouse
operations and demonstrated the method's effectiveness for real-world
applications.
comment: 7 pages, 3 figures, accepted to IPIN 2025
☆ AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM
Autonomous robots in orchards require real-time 3D scene understanding
despite repetitive row geometry, seasonal appearance changes, and wind-driven
foliage motion. We present AgriGS-SLAM, a Visual--LiDAR SLAM framework that
couples direct LiDAR odometry and loop closures with multi-camera 3D Gaussian
Splatting (3DGS) rendering. Batch rasterization across complementary viewpoints
recovers orchard structure under occlusions, while a unified gradient-driven
map lifecycle executed between keyframes preserves fine details and bounds
memory. Pose refinement is guided by a probabilistic LiDAR-based depth
consistency term, back-propagated through the camera projection to tighten
geometry-appearance coupling. We deploy the system on a field platform in apple
and pear orchards across dormancy, flowering, and harvesting, using a
standardized trajectory protocol that evaluates both training-view and
novel-view synthesis to reduce 3DGS overfitting in evaluation. Across seasons
and sites, AgriGS-SLAM delivers sharper, more stable reconstructions and
steadier trajectories than recent state-of-the-art 3DGS-SLAM baselines while
maintaining real-time performance on-tractor. While demonstrated in orchard
monitoring, the approach can be applied to other outdoor domains requiring
robust multimodal perception.
☆ GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model? ICLR 2026
Image super-resolution(SR) is fundamental to many vision system-from
surveillance and autonomy to document analysis and retail analytics-because
recovering high-frequency details, especially scene-text, enables reliable
downstream perception. Scene-text, i.e., text embedded in natural images such
as signs, product labels, and storefronts, often carries the most actionable
information; when characters are blurred or hallucinated, optical character
recognition(OCR) and subsequent decisions fail even if the rest of the image
appears sharp. Yet previous SR research has often been tuned to distortion
(PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that
are largely insensitive to character-level errors. Furthermore, studies that do
address text SR often focus on simplified benchmarks with isolated characters,
overlooking the challenges of text within complex natural scenes. As a result,
scene-text is effectively treated as generic texture. For SR to be effective in
practical deployments, it is therefore essential to explicitly optimize for
both text legibility and perceptual quality. We present GLYPH-SR, a
vision-language-guided diffusion framework that aims to achieve both objectives
jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by
OCR data, and a ping-pong scheduler that alternates between text- and
scene-centric guidance. To enable targeted text restoration, we train these
components on a synthetic corpus while keeping the main SR branch frozen.
Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by
up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR)
while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed
to satisfy both objectives simultaneously-high readability and high visual
realism-delivering SR that looks right and reds right.
comment: 11 pages, 6 figures. Includes supplementary material. Under review as
a conference paper at ICLR 2026
☆ A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading
Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged
and elderly people, which significantly impacts their daily lives and mental
health. To improve the efficiency of clinical screening and enable the early
detection of DR, a variety of automated DR diagnosis systems have been recently
established based on convolutional neural network (CNN) or vision Transformer
(ViT). However, due to the own shortages of CNN / ViT, the performance of
existing methods using single-type backbone has reached a bottleneck. One
potential way for the further improvements is integrating different kinds of
backbones, which can fully leverage the respective strengths of them
(\emph{i.e.,} the local feature extraction capability of CNN and the global
feature capturing ability of ViT). To this end, we propose a novel paradigm to
effectively fuse the features extracted by different backbones based on the
theory of evidence. Specifically, the proposed evidential fusion paradigm
transforms the features from different backbones into supporting evidences via
a set of deep evidential networks. With the supporting evidences, the
aggregated opinion can be accordingly formed, which can be used to adaptively
tune the fusion pattern between different backbones and accordingly boost the
performance of our hybrid model. We evaluated our method on two publicly
available DR grading datasets. The experimental results demonstrate that our
hybrid model not only improves the accuracy of DR grading, compared to the
state-of-the-art frameworks, but also provides the excellent interpretability
for feature fusion and decision-making.
☆ Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG
The subject of this work is to check how different types of music affect
human emotions. While listening to music, a subjective survey and brain
activity measurements were carried out using an EEG helmet. The aim is to
demonstrate the impact of different music genres on emotions. The research
involved a diverse group of participants of different gender and musical
preferences. This had the effect of capturing a wide range of emotional
responses to music. After the experiment, a relationship analysis of the
respondents' questionnaires with EEG signals was performed. The analysis
revealed connections between emotions and observed brain activity.
comment: Published at IWAIPR 2025 conference
☆ Towards Realistic Earth-Observation Constellation Scheduling: Benchmark and Methodology
Agile Earth Observation Satellites (AEOSs) constellations offer unprecedented
flexibility for monitoring the Earth's surface, but their scheduling remains
challenging under large-scale scenarios, dynamic environments, and stringent
constraints. Existing methods often simplify these complexities, limiting their
real-world performance. We address this gap with a unified framework
integrating a standardized benchmark suite and a novel scheduling model. Our
benchmark suite, AEOS-Bench, contains $3,907$ finely tuned satellite assets and
$16,410$ scenarios. Each scenario features $1$ to $50$ satellites and $50$ to
$300$ imaging tasks. These scenarios are generated via a high-fidelity
simulation platform, ensuring realistic satellite behavior such as orbital
dynamics and resource constraints. Ground truth scheduling annotations are
provided for each scenario. To our knowledge, AEOS-Bench is the first
large-scale benchmark suite tailored for realistic constellation scheduling.
Building upon this benchmark, we introduce AEOS-Former, a Transformer-based
scheduling model that incorporates a constraint-aware attention mechanism. A
dedicated internal constraint module explicitly models the physical and
operational limits of each satellite. Through simulation-based iterative
learning, AEOS-Former adapts to diverse scenarios, offering a robust solution
for AEOS constellation scheduling. Experimental results demonstrate that
AEOS-Former outperforms baseline models in task completion and energy
efficiency, with ablation studies highlighting the contribution of each
component. Code and data are provided in
https://github.com/buaa-colalab/AEOSBench.
☆ Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping
We focus on ocular biometrics, specifically the periocular region (the area
around the eye), which offers high discrimination and minimal acquisition
constraints. We evaluate three Convolutional Neural Network architectures of
varying depth and complexity to assess their effectiveness for periocular
recognition. The networks are trained on 1,907,572 ocular crops extracted from
the large-scale VGGFace2 database. This significantly contrasts with existing
works, which typically rely on small-scale periocular datasets for training
having only a few thousand images. Experiments are conducted with ocular images
from VGGFace2-Pose, a subset of VGGFace2 containing in-the-wild face images,
and the UFPR-Periocular database, which consists of selfies captured via mobile
devices with user guidance on the screen. Due to the uncontrolled conditions of
VGGFace2, the Equal Error Rates (EERs) obtained with ocular crops range from
9-15%, noticeably higher than the 3-6% EERs achieved using full-face images. In
contrast, UFPR-Periocular yields significantly better performance (EERs of
1-2%), thanks to higher image quality and more consistent acquisition
protocols. To the best of our knowledge, these are the lowest reported EERs on
the UFPR dataset to date.
comment: Published at IWAIPR 2025 conference
☆ Beyond Imitation: Constraint-Aware Trajectory Generation with Flow Matching For End-to-End Autonomous Driving
Planning is a critical component of end-to-end autonomous driving. However,
prevailing imitation learning methods often suffer from mode collapse, failing
to produce diverse trajectory hypotheses. Meanwhile, existing generative
approaches struggle to incorporate crucial safety and physical constraints
directly into the generative process, necessitating an additional optimization
stage to refine their outputs. To address these limitations, we propose CATG, a
novel planning framework that leverages Constrained Flow Matching. Concretely,
CATG explicitly models the flow matching process, which inherently mitigates
mode collapse and allows for flexible guidance from various conditioning
signals. Our primary contribution is the novel imposition of explicit
constraints directly within the flow matching process, ensuring that the
generated trajectories adhere to vital safety and kinematic rules. Secondly,
CATG parameterizes driving aggressiveness as a control signal during
generation, enabling precise manipulation of trajectory style. Notably, on the
NavSim v2 challenge, CATG achieved 2nd place with an EPDMS score of 51.31 and
was honored with the Innovation Award.
☆ Exploring Complementarity and Explainability in CNNs for Periocular Verification Across Acquisition Distances
We study the complementarity of different CNNs for periocular verification at
different distances on the UBIPr database. We train three architectures of
increasing complexity (SqueezeNet, MobileNetv2, and ResNet50) on a large set of
eye crops from VGGFace2. We analyse performance with cosine and chi2 metrics,
compare different network initialisations, and apply score-level fusion via
logistic regression. In addition, we use LIME heatmaps and Jensen-Shannon
divergence to compare attention patterns of the CNNs. While ResNet50
consistently performs best individually, the fusion provides substantial gains,
especially when combining all three networks. Heatmaps show that networks
usually focus on distinct regions of a given image, which explains their
complementarity. Our method significantly outperforms previous works on UBIPr,
achieving a new state-of-the-art.
comment: Accepted at BIOSIG 2025 conference
☆ Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws NeurIPS 2025
Existing infrared and visible image fusion methods often face the dilemma of
balancing modal information. Generative fusion methods reconstruct fused images
by learning from data distributions, but their generative capabilities remain
limited. Moreover, the lack of interpretability in modal information selection
further affects the reliability and consistency of fusion results in complex
scenarios. This manuscript revisits the essence of generative image fusion
under the inspiration of human cognitive laws and proposes a novel infrared and
visible image fusion method, termed HCLFuse. First, HCLFuse investigates the
quantification theory of information mapping in unsupervised fusion networks,
which leads to the design of a multi-scale mask-regulated variational
bottleneck encoder. This encoder applies posterior probability modeling and
information decomposition to extract accurate and concise low-level modal
information, thereby supporting the generation of high-fidelity structural
details. Furthermore, the probabilistic generative capability of the diffusion
model is integrated with physical laws, forming a time-varying physical
guidance mechanism that adaptively regulates the generation process at
different stages, thereby enhancing the ability of the model to perceive the
intrinsic structure of data and reducing dependence on data quality.
Experimental results show that the proposed method achieves state-of-the-art
fusion performance in qualitative and quantitative evaluations across multiple
datasets and significantly improves semantic segmentation metrics. This fully
demonstrates the advantages of this generative image fusion method, drawing
inspiration from human cognition, in enhancing structural consistency and
detail quality.
comment: NeurIPS 2025 spotlight
☆ Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Modern vision-language models (VLMs) excel at many multimodal tasks, yet
their grasp of temporal information in video remains weak and, crucially,
under-evaluated. We probe this gap with a deceptively simple but revealing
challenge: judging the arrow of time (AoT)-whether a short clip is played
forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated
benchmark that tests whether VLMs can infer temporal direction in natural
videos using the same stimuli and behavioral baselines established for humans.
Our comprehensive evaluation of open-weight and proprietary, reasoning and
non-reasoning VLMs reveals that most models perform near chance, and even the
best lag far behind human accuracy on physically irreversible processes (e.g.,
free fall, diffusion/explosion) and causal manual actions (division/addition)
that humans recognize almost instantly. These results highlight a fundamental
gap in current multimodal systems: while they capture rich visual-semantic
correlations, they lack the inductive biases required for temporal continuity
and causal understanding. We release the code and data for AoT-PsyPhyBENCH to
encourage further progress in the physical and temporal reasoning capabilities
of VLMs.
comment: 10 pages
☆ OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation
Document AI has advanced rapidly and is attracting increasing attention. Yet,
while most efforts have focused on document layout analysis (DLA), its
generative counterpart, document layout generation, remains underexplored. A
major obstacle lies in the scarcity of diverse layouts: academic papers with
Manhattan-style structures dominate existing studies, while open-world genres
such as newspapers and magazines remain severely underrepresented. To address
this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse
document layouts, covering six common document types and comprising
contemporary layouts collected from multiple sources. Moreover, since existing
methods struggle in complex domains and often fail to arrange long sequences
coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage
Coarse-to-Fine learning paradigm: 1) learning universal layout principles from
OmniLayout-1M with coarse category definitions, and 2) transferring the
knowledge to a specific domain with fine-grained annotations. Extensive
experiments demonstrate that our approach achieves strong performance on
multiple domains in M$^{6}$Doc dataset, substantially surpassing both existing
layout generation experts and several latest general-purpose LLMs. Our code,
models, and dataset will be publicly released.
comment: TL;DR: With OmniLayout-1M dataset and LLM-based coarse-to-fine
learning, we enable universal and diverse document layout generation
☆ Developing a Multi-task Ensemble Geometric Deep Network for Supply Chain Sustainability and Risk Management
The sustainability of supply chain plays a key role in achieving optimal
performance in controlling the supply chain. The management of risks that occur
in a supply chain is a fundamental problem for the purpose of developing the
sustainability of the network and elevating the performance efficiency of the
supply chain. The correct classification of products is another essential
element in a sustainable supply chain. Acknowledging recent breakthroughs in
the context of deep networks, several architectural options have been deployed
to analyze supply chain datasets. A novel geometric deep network is used to
propose an ensemble deep network. The proposed Chebyshev ensemble geometric
network (Ch-EGN) is a hybrid convolutional and geometric deep learning. This
network is proposed to leverage the information dependencies in supply chain to
derive invisible states of samples in the database. The functionality of the
proposed deep network is assessed on the two different databases. The
SupplyGraph Dataset and DataCo are considered in this research. The prediction
of delivery status of DataCo supply chain is done for risk administration. The
product classification and edge classification are performed using the
SupplyGraph database to enhance the sustainability of the supply network. An
average accuracy of 98.95% is obtained for the ensemble network for risk
management. The average accuracy of 100% and 98.07% are obtained for
sustainable supply chain in terms of 5 product group classification and 4
product relation classification, respectively. The average accuracy of 92.37%
is attained for 25 company relation classification. The results confirm an
average improvement and efficiency of the proposed method compared to the
state-of-the-art approaches.
☆ Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction SIGGRAPH
3D human pose estimation from sketches has broad applications in computer
animation and film production. Unlike traditional human pose estimation, this
task presents unique challenges due to the abstract and disproportionate nature
of sketches. Previous sketch-to-pose methods, constrained by the lack of
large-scale sketch-3D pose annotations, primarily relied on optimization with
heuristic rules-an approach that is both time-consuming and limited in
generalizability. To address these challenges, we propose a novel approach
leveraging a "learn from synthesis" strategy. First, a diffusion model is
trained to synthesize sketch images from 2D poses projected from 3D human
poses, mimicking disproportionate human structures in sketches. This process
enables the creation of a synthetic dataset, SKEP-120K, consisting of 120k
accurate sketch-3D pose annotation pairs across various sketch styles. Building
on this synthetic dataset, we introduce an end-to-end data-driven framework for
estimating human poses and shapes from diverse sketch styles. Our framework
combines existing 2D pose detectors and generative diffusion priors for sketch
feature extraction with a feed-forward neural network for efficient 2D pose
estimation. Multiple heuristic loss functions are incorporated to guarantee
geometric coherence between the derived 3D poses and the detected 2D poses
while preserving accurate self-contacts. Qualitative, quantitative, and
subjective evaluations collectively show that our model substantially surpasses
previous ones in both estimation accuracy and speed for sketch-to-pose tasks.
comment: SIGGRAPH Asia 2025
☆ ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts NeurIPS 2025
Dataset bias, where data points are skewed to certain concepts, is ubiquitous
in machine learning datasets. Yet, systematically identifying these biases is
challenging without costly, fine-grained attribute annotations. We present
ConceptScope, a scalable and automated framework for analyzing visual datasets
by discovering and quantifying human-interpretable concepts using Sparse
Autoencoders trained on representations from vision foundation models.
ConceptScope categorizes concepts into target, context, and bias types based on
their semantic relevance and statistical correlation to class labels, enabling
class-level dataset characterization, bias identification, and robustness
evaluation through concept-based subgrouping. We validate that ConceptScope
captures a wide range of visual concepts, including objects, textures,
backgrounds, facial attributes, emotions, and actions, through comparisons with
annotated datasets. Furthermore, we show that concept activations produce
spatial attributions that align with semantically meaningful image regions.
ConceptScope reliably detects known biases (e.g., background bias in
Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects
in ImageNet), offering a practical tool for dataset auditing and model
diagnostics.
comment: Published in the Thirty-Ninth Conference on Neural Information
Processing Systems (NeurIPS 2025)
☆ MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models
Accurate estimation of motion information is crucial in diverse computational
imaging and computer vision applications. Researchers have investigated various
methods to extract motion information from a single blurred image, including
blur kernels and optical flow. However, existing motion representations are
often of low quality, i.e., coarse-grained and inaccurate. In this paper, we
propose the first high-resolution (HR) Motion Trajectory estimation framework
using Diffusion models (MoTDiff). Different from existing motion
representations, we aim to estimate an HR motion trajectory with high-quality
from a single motion-blurred image. The proposed MoTDiff consists of two key
components: 1) a new conditional diffusion framework that uses multi-scale
feature maps extracted from a single blurred image as a condition, and 2) a new
training method that can promote precise identification of a fine-grained
motion trajectory, consistent estimation of overall shape and position of a
motion path, and pixel connectivity along a motion trajectory. Our experiments
demonstrate that the proposed MoTDiff can outperform state-of-the-art methods
in both blind image deblurring and coded exposure photography applications.
comment: 10 pages, 6 figures
☆ Self-localization on a 3D map by fusing global and local features from a monocular camera
Self-localization on a 3D map by using an inexpensive monocular camera is
required to realize autonomous driving. Self-localization based on a camera
often uses a convolutional neural network (CNN) that can extract local features
that are calculated by nearby pixels. However, when dynamic obstacles, such as
people, are present, CNN does not work well. This study proposes a new method
combining CNN with Vision Transformer, which excels at extracting global
features that show the relationship of patches on whole image. Experimental
results showed that, compared to the state-of-the-art method (SOTA), the
accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times
higher than that without dynamic obstacles. Moreover, the self-localization
error of our method is 20.1% smaller than that of SOTA on public datasets.
Additionally, our robot using our method can localize itself with 7.51cm error
on average, which is more accurate than SOTA.
☆ CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu, Skyler Yeatman, Chen Zhou, Gunveer Gujral, Yinglong Xia, Shane Moon, Nicolas Scheffer, Nirav Shah, Eun Chang, Yue Liu, Florian Metze, Tammy Stark, Zhaleh Feizollahi, Andrea Jessee, Mangesh Pujari, Ahmed Aly, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Wen-tau Yih, Xin Luna Dong
Wearable devices such as smart glasses are transforming the way people
interact with their surroundings, enabling users to seek information regarding
entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG)
plays a key role in supporting such questions, yet there is still no
comprehensive benchmark for this task, especially regarding wearables
scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG
benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse
set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn
conversations across 13 domains, including 6.2K egocentric images designed to
mimic captures from wearable devices. We carefully constructed the questions to
reflect real-world scenarios and challenges, including five types of
image-quality issues, six question types, varying entity popularity, differing
information dynamism, and different conversation turns. We design three tasks:
single-source augmentation, multi-source augmentation, and multi-turn
conversations -- each paired with an associated retrieval corpus and APIs for
both image-KG retrieval and webpage retrieval. Our evaluation shows that
straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM
single- and multi-turn QA, respectively, whereas state-of-the-art industry
solutions have similar quality (32%/45%), underscoring ample room for
improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K
participants and 5K submissions, with winning solutions improving baseline
performance by 28%, highlighting its early impact on advancing the field.
☆ Detecting Unauthorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh
Sudipto Das Sukanto, Diponker Roy, Fahim Shakil, Nirjhar Singha, Abdullah Asik, Aniket Joarder, Mridha Md Nafis Fuad, Muhammad Ibrahim
Modes of transportation vary across countries depending on geographical
location and cultural context. In South Asian countries rickshaws are among the
most common means of local transport. Based on their mode of operation,
rickshaws in cities across Bangladesh can be broadly classified into non-auto
(pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of
auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from
accessing certain routes. However, existing surveillance systems make it quite
difficult to monitor them due to their similarity to other vehicles, especially
non-auto rickshaws whereas manual video analysis is too time-consuming. This
paper presents a machine learning-based approach to automatically detect
auto-rickshaws in traffic images. In this system, we used real-time object
detection using the YOLOv8 model. For training purposes, we prepared a set of
1,730 annotated images that were captured under various traffic conditions. The
results show that our proposed model performs well in real-time auto-rickshaw
detection and offers an mAP50 of 83.447% and binary precision and recall values
above 78%, demonstrating its effectiveness in handling both dense and sparse
traffic scenarios. The dataset has been publicly released for further research.
comment: 16 pages
☆ MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction ICCV 2025
Large annotated datasets are essential for training robust Computer-Aided
Diagnosis (CAD) models for breast cancer detection or risk prediction. However,
acquiring such datasets with fine-detailed annotation is both costly and
time-consuming. Vision-Language Models (VLMs), such as CLIP, which are
pre-trained on large image-text pairs, offer a promising solution by enhancing
robustness and data efficiency in medical imaging tasks. This paper introduces
a novel Multi-View Mammography and Language Model for breast cancer
classification and risk prediction, trained on a dataset of paired mammogram
images and synthetic radiology reports. Our MV-MLM leverages multi-view
supervision to learn rich representations from extensive radiology data by
employing cross-modal self-supervision across image-text pairs. This includes
multiple views and the corresponding pseudo-radiology reports. We propose a
novel joint visual-textual learning strategy to enhance generalization and
accuracy performance over different data types and tasks to distinguish breast
tissues or cancer characteristics(calcification, mass) and utilize these
patterns to understand mammography images and predict cancer risk. We evaluated
our method on both private and publicly available datasets, demonstrating that
the proposed model achieves state-of-the-art performance in three
classification tasks: (1) malignancy classification, (2) subtype
classification, and (3) image-based cancer risk prediction. Furthermore, the
model exhibits strong data efficiency, outperforming existing fully supervised
or VLM baselines while trained on synthetic text reports and without the need
for actual radiology reports.
comment: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD)
Workshop at ICCV 2025
☆ BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution
of video frames, potentially at various scaling factors, which presents several
challenges regarding spatial detail reproduction, temporal consistency, and
computational complexity. In this paper, we propose a strong baseline BasicAVSR
for AVSR by integrating four key components: 1) adaptive multi-scale frequency
priors generated from image Laplacian pyramids, 2) a flow-guided propagation
unit to aggregate spatiotemporal information from adjacent frames, 3) a
second-order motion compensation unit for more accurate spatial alignment of
adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and
content-independent upsampling kernels. To meet diverse application demands, we
instantiate three propagation variants: (i) a unidirectional RNN unit for
strictly online inference, (ii) a unidirectional RNN unit empowered with a
limited lookahead that tolerates a small output delay, and (iii) a
bidirectional RNN unit designed for offline tasks where computational resources
are less constrained. Experimental results demonstrate the effectiveness and
adaptability of our model across these different scenarios. Through extensive
experiments, we show that BasicAVSR significantly outperforms existing methods
in terms of super-resolution quality, generalization ability, and inference
speed. Our work not only advances the state-of-the-art in AVSR but also extends
its core components to multiple frameworks for diverse scenarios. The code is
available at https://github.com/shangwei5/BasicAVSR.
comment: 13 pages, 10 figures, 5 tables
☆ StructLayoutFormer:Conditional Structured Layout Generation via Structure Serialization and Disentanglement
Structured layouts are preferable in many 2D visual contents (\eg, GUIs,
webpages) since the structural information allows convenient layout editing.
Computational frameworks can help create structured layouts but require heavy
labor input. Existing data-driven approaches are effective in automatically
generating fixed layouts but fail to produce layout structures. We present
StructLayoutFormer, a novel Transformer-based approach for conditional
structured layout generation. We use a structure serialization scheme to
represent structured layouts as sequences. To better control the structures of
generated layouts, we disentangle the structural information from the element
placements. Our approach is the first data-driven approach that achieves
conditional structured layout generation and produces realistic layout
structures explicitly. We compare our approach with existing data-driven layout
generation approaches by including post-processing for structure extraction.
Extensive experiments have shown that our approach exceeds these baselines in
conditional structured layout generation. We also demonstrate that our approach
is effective in extracting and transferring layout structures. The code is
publicly available at %\href{https://github.com/Teagrus/StructLayoutFormer}
{https://github.com/Teagrus/StructLayoutFormer}.
☆ FullPart: Generating each 3D Part at Full Resolution
Lihe Ding, Shaocong Dong, Yaokun Li, Chenjian Gao, Xiao Chen, Rui Han, Yihao Kuang, Hong Zhang, Bo Huang, Zhanpeng Huang, Zibin Wang, Dan Xu, Tianfan Xue
Part-based 3D generation holds great potential for various applications.
Previous part generators that represent parts using implicit vector-set tokens
often suffer from insufficient geometric details. Another line of work adopts
an explicit voxel representation but shares a global voxel grid among all
parts; this often causes small parts to occupy too few voxels, leading to
degraded quality. In this paper, we propose FullPart, a novel framework that
combines both implicit and explicit paradigms. It first derives the bounding
box layout through an implicit box vector-set diffusion process, a task that
implicit diffusion handles effectively since box tokens contain little
geometric detail. Then, it generates detailed parts, each within its own fixed
full-resolution voxel grid. Instead of sharing a global low-resolution space,
each part in our method - even small ones - is generated at full resolution,
enabling the synthesis of intricate details. We further introduce a
center-point encoding strategy to address the misalignment issue when
exchanging information between parts of different actual sizes, thereby
maintaining global coherence. Moreover, to tackle the scarcity of reliable part
data, we present PartVerse-XL, the largest human-annotated 3D part dataset to
date with 40K objects and 320K parts. Extensive experiments demonstrate that
FullPart achieves state-of-the-art results in 3D part generation. We will
release all code, data, and model to benefit future research in 3D part
generation.
comment: Project page: https://fullpart3d.github.io
☆ Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM
Ali Caglayan, Nevrez Imamoglu, Oguzhan Guclu, Ali Osman Serhatoglu, Ahmet Burak Can, Ryosuke Nakamura
Attention models have recently emerged as a powerful approach, demonstrating
significant progress in various fields. Visualization techniques, such as class
activation mapping, provide visual insights into the reasoning of convolutional
neural networks (CNNs). Using network gradients, it is possible to identify
regions where the network pays attention during image recognition tasks.
Furthermore, these gradients can be combined with CNN features to localize more
generalizable, task-specific attentive (salient) regions within scenes.
However, explicit use of this gradient-based attention information integrated
directly into CNN representations for semantic object understanding remains
limited. Such integration is particularly beneficial for visual tasks like
simultaneous localization and mapping (SLAM), where CNN representations
enriched with spatially attentive object locations can enhance performance. In
this work, we propose utilizing task-specific network attention for RGB-D
indoor SLAM. Specifically, we integrate layer-wise attention information
derived from network gradients with CNN feature representations to improve
frame association performance. Experimental results indicate improved
performance compared to baseline methods, particularly for large environments.
comment: double-column 5 pages, 3 figures
☆ WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, Ben Sapp, Mingxing Tan, Jyh-Jing Hwang, Drago Anguelov
Vision-based end-to-end (E2E) driving has garnered significant interest in
the research community due to its scalability and synergy with multimodal large
language models (MLLMs). However, current E2E driving benchmarks primarily
feature nominal scenarios, failing to adequately test the true potential of
these systems. Furthermore, existing open-loop evaluation metrics often fall
short in capturing the multi-modal nature of driving or effectively evaluating
performance in long-tail scenarios. To address these gaps, we introduce the
Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021
driving segments (approximately 12 hours), specifically curated for challenging
long-tail scenarios that that are rare in daily life with an occurring
frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the
high-level routing information, ego states, and 360-degree camera views from 8
surrounding cameras. To evaluate the E2E driving performance on these long-tail
situations, we propose a novel open-loop evaluation metric: Rater Feedback
Score (RFS). Unlike conventional metrics that measure the distance between
predicted way points and the logs, RFS measures how closely the predicted
trajectory matches rater-annotated trajectory preference labels. We have
released rater preference labels for all WOD-E2E validation set segments, while
the held out test set labels have been used for the 2025 WOD-E2E Challenge.
Through our work, we aim to foster state of the art research into
generalizable, robust, and safe end-to-end autonomous driving agents capable of
handling complex real-world situations.
☆ JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting
Traditional novel view synthesis methods heavily rely on external camera pose
estimation tools such as COLMAP, which often introduce computational
bottlenecks and propagate errors. To address these challenges, we propose a
unified framework that jointly optimizes 3D Gaussian points and camera poses
without requiring pre-calibrated inputs. Our approach iteratively refines 3D
Gaussian parameters and updates camera poses through a novel co-optimization
strategy, ensuring simultaneous improvements in scene reconstruction fidelity
and pose accuracy. The key innovation lies in decoupling the joint optimization
into two interleaved phases: first, updating 3D Gaussian parameters via
differentiable rendering with fixed poses, and second, refining camera poses
using a customized 3D optical flow algorithm that incorporates geometric and
photometric constraints. This formulation progressively reduces projection
errors, particularly in challenging scenarios with large viewpoint variations
and sparse feature distributions, where traditional methods struggle. Extensive
evaluations on multiple datasets demonstrate that our approach significantly
outperforms existing COLMAP-free techniques in reconstruction quality, and also
surpasses the standard COLMAP-based baseline in general.
☆ OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research
Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, Xu Peng, Taisong Jin, Yongge Liu, Shengwei Han, Jing Yang, Xiaoping He, Feng Gao, AndyPian Wu, SevenShu, Chaoyang Wang, Chengjie Wang
As one of the earliest writing systems, Oracle Bone Script (OBS) preserves
the cultural and intellectual heritage of ancient civilizations. However,
current OBS research faces two major challenges: (1) the interpretation of OBS
involves a complex workflow comprising multiple serial and parallel sub-tasks,
and (2) the efficiency of OBS information organization and retrieval remains a
critical bottleneck, as scholars often spend substantial effort searching for,
compiling, and managing relevant resources. To address these challenges, we
present OracleAgent, the first agent system designed for the structured
management and retrieval of OBS-related information. OracleAgent seamlessly
integrates multiple OBS analysis tools, empowered by large language models
(LLMs), and can flexibly orchestrate these components. Additionally, we
construct a comprehensive domain-specific multimodal knowledge base for OBS,
which is built through a rigorous multi-year process of data collection,
cleaning, and expert annotation. The knowledge base comprises over 1.4M
single-character rubbing images and 80K interpretation texts. OracleAgent
leverages this resource through its multimodal tools to assist experts in
retrieval tasks of character, document, interpretation text, and rubbing image.
Extensive experiments demonstrate that OracleAgent achieves superior
performance across a range of multimodal reasoning and generation tasks,
surpassing leading mainstream multimodal large language models (MLLMs) (e.g.,
GPT-4o). Furthermore, our case study illustrates that OracleAgent can
effectively assist domain experts, significantly reducing the time cost of OBS
research. These results highlight OracleAgent as a significant step toward the
practical deployment of OBS-assisted research and automated interpretation
systems.
☆ EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Can Video-LLMs achieve consistent temporal understanding when videos capture
the same event from different viewpoints? To study this, we introduce
EgoExo-Con (Consistency), a benchmark of comprehensively synchronized
egocentric and exocentric video pairs with human-refined queries in natural
language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal
Verification and Temporal Grounding. It evaluates not only correctness but
consistency across viewpoints. Our analysis reveals two critical limitations of
existing Video-LLMs: (1) models often fail to maintain consistency, with
results far worse than their single-view performances. (2) When naively
finetuned with synchronized videos of both viewpoints, the models show improved
consistency but often underperform those trained on a single view. For
improvements, we propose View-GRPO, a novel reinforcement learning framework
that effectively strengthens view-specific temporal reasoning while encouraging
consistent comprehension across viewpoints. Our method demonstrates its
superiority over naive SFT and GRPO, especially for improving cross-view
consistency. All resources will be made publicly available.
comment: project page:
\url{https://minjoong507.github.io/projects/EgoExo-Con/}
☆ Security Risk of Misalignment between Text and Image in Multi-modal Model
Despite the notable advancements and versatility of multi-modal diffusion
models, such as text-to-image models, their susceptibility to adversarial
inputs remains underexplored. Contrary to expectations, our investigations
reveal that the alignment between textual and Image modalities in existing
diffusion models is inadequate. This misalignment presents significant risks,
especially in the generation of inappropriate or Not-Safe-For-Work (NSFW)
content. To this end, we propose a novel attack called Prompt-Restricted
Multi-modal Attack (PReMA) to manipulate the generated content by modifying the
input image in conjunction with any specified prompt, without altering the
prompt itself. PReMA is the first attack that manipulates model outputs by
solely creating adversarial images, distinguishing itself from prior methods
that primarily generate adversarial prompts to produce NSFW content.
Consequently, PReMA poses a novel threat to the integrity of multi-modal
diffusion models, particularly in image-editing applications that operate with
fixed prompts. Comprehensive evaluations conducted on image inpainting and
style transfer tasks across various models confirm the potent efficacy of
PReMA.
☆ Dynamic VLM-Guided Negative Prompting for Diffusion Models NeurIPS 2025
We propose a novel approach for dynamic negative prompting in diffusion
models that leverages Vision-Language Models (VLMs) to adaptively generate
negative prompts during the denoising process. Unlike traditional Negative
Prompting methods that use fixed negative prompts, our method generates
intermediate image predictions at specific denoising steps and queries a VLM to
produce contextually appropriate negative prompts. We evaluate our approach on
various benchmark datasets and demonstrate the trade-offs between negative
guidance strength and text-image alignment.
comment: 39th Conference on Neural Information Processing Systems (NeurIPS
2025) Workshop: The First Workshop on Generative and Protective AI for
Content Creation
☆ FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation
Yuyue Zhou, Jessica Knight, Shrimanti Ghosh, Banafshe Felfeliyan, Jacob L. Jaremko, Abhilash R. Hareendranathan
Elbow and wrist fractures are the most common fractures in pediatric
populations. Automatic segmentation of musculoskeletal structures in ultrasound
(US) can improve diagnostic accuracy and treatment planning. Fractures appear
as cortical defects but require expert interpretation. Deep learning (DL) can
provide real-time feedback and highlight key structures, helping lightly
trained users perform exams more confidently. However, pixel-wise expert
annotations for training remain time-consuming and costly. To address this
challenge, we propose FlexICL, a novel and flexible in-context learning (ICL)
framework for segmenting bony regions in US images. We apply it to an
intra-video segmentation setting, where experts annotate only a small subset of
frames, and the model segments unseen frames. We systematically investigate
various image concatenation techniques and training strategies for visual ICL
and introduce novel concatenation methods that significantly enhance model
performance with limited labeled data. By integrating multiple augmentation
strategies, FlexICL achieves robust segmentation performance across four wrist
and elbow US datasets while requiring only 5% of the training images. It
outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and
conventional segmentation models like U-Net and TransUNet by 1-27% Dice
coefficient on 1,252 US sweeps. These initial results highlight the potential
of FlexICL as an efficient and scalable solution for US image segmentation well
suited for medical imaging use cases where labeled data is scarce.
☆ Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Knowledge distillation (KD) is an effective method for model compression and
transferring knowledge between models. However, its effect on model's
robustness against spurious correlations that degrade performance on
out-of-distribution data remains underexplored. This study investigates the
effect of knowledge distillation on the transferability of ``debiasing''
capabilities from teacher models to student models on natural language
inference (NLI) and image classification tasks. Through extensive experiments,
we illustrate several key findings: (i) overall the debiasing capability of a
model is undermined post-KD; (ii) training a debiased model does not benefit
from injecting teacher knowledge; (iii) although the overall robustness of a
model may remain stable post-distillation, significant variations can occur
across different types of biases; and (iv) we pin-point the internal attention
pattern and circuit that causes the distinct behavior post-KD. Given the above
findings, we propose three effective solutions to improve the distillability of
debiasing methods: developing high quality data for augmentation, implementing
iterative knowledge distillation, and initializing student models with weights
obtained from teacher models. To the best of our knowledge, this is the first
study on the effect of KD on debiasing and its interenal mechanism at scale.
Our findings provide understandings on how KD works and how to design better
debiasing methods.
♻ ☆ Smoothing Slot Attention Iterations and Recurrences
Slot Attention (SA) and its variants lie at the heart of mainstream
Object-Centric Learning (OCL). Objects in an image can be aggregated into
respective slot vectors, by \textit{iteratively} refining cold-start query
vectors, typically three times, via SA on image features. For video, such
aggregation is \textit{recurrently} shared across frames, with queries
cold-started on the first frame while transitioned from the previous frame's
slots on non-first frames. However, the cold-start queries lack sample-specific
cues thus hinder precise aggregation on the image or video's first frame; Also,
non-first frames' queries are already sample-specific thus require transforms
different from the first frame's aggregation. We address these issues for the
first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image
or video's first frame, we \textit{preheat} the cold-start queries with rich
information of input features, via a tiny module self-distilled inside OCL; (2)
To smooth SA recurrences across all video frames, we \textit{differentiate} the
homogeneous transforms on the first and non-first frames, by using full and
single iterations respectively. Comprehensive experiments on object discovery,
recognition and downstream benchmarks validate our method's effectiveness.
Further analyses intuitively illuminate how our method smooths SA iterations
and recurrences. Our source code, model checkpoints and training logs are
available on https://github.com/Genera1Z/SmoothSA.
♻ ☆ Predicting Video Slot Attention Queries from Random Slot-Feature Pairs
Unsupervised video Object-Centric Learning (OCL) is promising as it enables
object-level scene representation and dynamics modeling as we humans do.
Mainstream video OCL methods adopt a recurrent architecture: An aggregator
aggregates current video frame into object features, termed slots, under some
queries; A transitioner transits current slots to queries for the next frame.
This is an effective architecture but all existing implementations both
(\textit{i1}) neglect to incorporate next frame features, the most informative
source for query prediction, and (\textit{i2}) fail to learn transition
dynamics, the knowledge essential for query prediction. To address these
issues, we propose Random Slot-Feature pair for learning Query prediction
(RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both
slots and features, which provides more information for query prediction;
(\textit{t2}) We train the transitioner to predict queries from slot-feature
pairs randomly sampled from available recurrences, which drives it to learn
transition dynamics. Experiments on scene representation demonstrate that our
method surpass existing video OCL methods significantly, e.g., up to 10 points
on object discovery, setting new state-of-the-art. Such superiority also
benefits downstream tasks like dynamics modeling. Our core source code, model
checkpoints and training logs are available on
https://github.com/Genera1Z/RandSF.Q.
♻ ☆ Locality in Image Diffusion Models Emerges from Data Statistics
Recent work has shown that the generalization ability of image diffusion
models arises from the locality properties of the trained neural network. In
particular, when denoising a particular pixel, the model relies on a limited
neighborhood of the input image around that pixel, which, according to the
previous work, is tightly related to the ability of these models to produce
novel images. Since locality is central to generalization, it is crucial to
understand why diffusion models learn local behavior in the first place, as
well as the factors that govern the properties of locality patterns. In this
work, we present evidence that the locality in deep diffusion models emerges as
a statistical property of the image dataset and is not due to the inductive
bias of convolutional neural networks, as suggested in previous work.
Specifically, we demonstrate that an optimal parametric linear denoiser
exhibits similar locality properties to deep neural denoisers. We show, both
theoretically and experimentally, that this locality arises directly from pixel
correlations present in the image datasets. Moreover, locality patterns are
drastically different on specialized datasets, approximating principal
components of the data's covariance. We use these insights to craft an
analytical denoiser that better matches scores predicted by a deep diffusion
model than prior expert-crafted alternatives. Our key takeaway is that while
neural network architectures influence generation quality, their primary role
is to capture locality patterns inherent in the data.
comment: 31 pages, 20 figures, 7 tables
♻ ☆ ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models
Despite the success of deep learning across various domains, it remains
vulnerable to adversarial attacks. Although many existing adversarial attack
methods achieve high success rates, they typically rely on $\ell_{p}$-norm
perturbation constraints, which do not align with human perceptual
capabilities. Consequently, researchers have shifted their focus toward
generating natural, unrestricted adversarial examples (UAEs). GAN-based
approaches suffer from inherent limitations, such as poor image quality due to
instability and mode collapse. Meanwhile, diffusion models have been employed
for UAE generation, but they still rely on iterative PGD perturbation
injection, without fully leveraging their central denoising capabilities. In
this paper, we introduce a novel approach for generating UAEs based on
diffusion models, named ScoreAdv. This method incorporates an interpretable
adversarial guidance mechanism to gradually shift the sampling distribution
towards the adversarial distribution, while using an interpretable saliency map
to inject the visual information of a reference image into the generated
samples. Notably, our method is capable of generating an unlimited number of
natural adversarial examples and can attack not only classification models but
also retrieval models. We conduct extensive experiments on ImageNet and CelebA
datasets, validating the performance of ScoreAdv across ten target models in
both black-box and white-box settings. Our results demonstrate that ScoreAdv
achieves state-of-the-art attack success rates and image quality, while
maintaining inference efficiency. Furthermore, the dynamic balance between
denoising and adversarial perturbation enables ScoreAdv to remain robust even
under defensive measures.
♻ ☆ GSE: Group-wise Sparse and Explainable Adversarial Attacks
Sparse adversarial attacks fool deep neural networks (DNNs) through minimal
pixel perturbations, often regularized by the $\ell_0$ norm. Recent efforts
have replaced this norm with a structural sparsity regularizer, such as the
nuclear group norm, to craft group-wise sparse adversarial attacks. The
resulting perturbations are thus explainable and hold significant practical
relevance, shedding light on an even greater vulnerability of DNNs. However,
crafting such attacks poses an optimization challenge, as it involves computing
norms for groups of pixels within a non-convex objective. We address this by
presenting a two-phase algorithm that generates group-wise sparse attacks
within semantically meaningful areas of an image. Initially, we optimize a
quasinorm adversarial loss using the $1/2-$quasinorm proximal operator tailored
for non-convex programming. Subsequently, the algorithm transitions to a
projected Nesterov's accelerated gradient descent with $2-$norm regularization
applied to perturbation magnitudes. Rigorous evaluations on CIFAR-10 and
ImageNet datasets demonstrate a remarkable increase in group-wise sparsity,
e.g., $50.9\%$ on CIFAR-10 and $38.4\%$ on ImageNet (average case, targeted
attack). This performance improvement is accompanied by significantly faster
computation times, improved explainability, and a $100\%$ attack success rate.
♻ ☆ Resource Efficient Multi-stain Kidney Glomeruli Segmentation via Self-supervision
Semantic segmentation under domain shift remains a fundamental challenge in
computer vision, particularly when labelled training data is scarce. This
challenge is particularly exemplified in histopathology image analysis, where
the same tissue structures must be segmented across images captured under
different imaging conditions (stains), each representing a distinct visual
domain. Traditional deep learning methods like UNet require extensive labels,
which is both costly and time-consuming, particularly when dealing with
multiple domains (or stains). To mitigate this, various unsupervised domain
adaptation based methods such as UDAGAN have been proposed, which reduce the
need for labels by requiring only one (source) stain to be labelled.
Nonetheless, obtaining source stain labels can still be challenging. This
article shows that through self-supervised pre-training -- including SimCLR,
BYOL, and a novel approach, HR-CS-CO -- the performance of these segmentation
methods (UNet, and UDAGAN) can be retained even with 95% fewer labels. Notably,
with self-supervised pre-training and using only 5% labels, the performance
drops are minimal: 5.9% for UNet and 6.2% for UDAGAN, averaged over all stains,
compared to their respective fully supervised counterparts (without
pre-training, using 100% labels). Furthermore, these findings are shown to
generalise beyond their training distribution to public benchmark datasets.
Implementations and pre-trained models are publicly available
\href{https://github.com/zeeshannisar/resource-effecient-multi-stain-kidney-glomeruli-segmentation.git}{online}.
comment: 39 pages, 10 figures, 4 Tables
♻ ☆ CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang
Recent vision-language-action (VLA) models built on pretrained
vision-language models (VLMs) have demonstrated strong performance in robotic
manipulation. However, these models remain constrained by the single-frame
image paradigm and fail to fully leverage the temporal information offered by
multi-frame histories, as directly feeding multiple frames into VLM backbones
incurs substantial computational overhead and inference latency. We propose
CronusVLA, a unified framework that extends single-frame VLA models to the
multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame
pretraining on large-scale embodied datasets with autoregressive prediction of
action tokens, establishing an effective embodied vision-language foundation;
(2) Multi-frame post-training, which adapts the prediction of the
vision-language backbone from discrete tokens to learnable features, and
aggregates historical information via feature chunking. CronusVLA effectively
addresses the existing challenges of multi-frame modeling while enhancing
performance and observational robustness. To evaluate the robustness under
temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel
benchmark featuring 24 types of observational disturbances and 120 severity
levels. Experiments across three embodiments in simulated and real-world
environments demonstrate that CronusVLA achieves leading performance and
superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8%
improvement over OpenVLA on LIBERO, and the highest robustness score on
SimplerEnv-OR. These results highlight the potential of efficient multi-frame
adaptation in VLA models for more powerful and robust real-world deployment.
comment: 39 pages, 24 figures
♻ ☆ Fit for Purpose? Deepfake Detection in the Real World
The rapid proliferation of AI-generated content, driven by advances in
generative adversarial networks, diffusion models, and multimodal large
language models, has made the creation and dissemination of synthetic media
effortless, heightening the risks of misinformation, particularly political
deepfakes that distort truth and undermine trust in political institutions. In
turn, governments, research institutions, and industry have strongly promoted
deepfake detection initiatives as solutions. Yet, most existing models are
trained and validated on synthetic, laboratory-controlled datasets, limiting
their generalizability to the kinds of real-world political deepfakes
circulating on social platforms that affect the public. In this work, we
introduce the first systematic benchmark based on the Political Deepfakes
Incident Database, a curated collection of real-world political deepfakes
shared on social media since 2018. Our study includes a systematic evaluation
of state-of-the-art deepfake detectors across academia, government, and
industry. We find that the detectors from academia and government perform
relatively poorly. While paid detection tools achieve relatively higher
performance than free-access models, all evaluated detectors struggle to
generalize effectively to authentic political deepfakes, and are vulnerable to
simple manipulations, especially in the video domain. Results urge the need for
politically contextualized deepfake detection frameworks to better safeguard
the public in real-world settings.
♻ ☆ DDL: A Large-Scale Datasets for Deepfake Detection and Localization in Diversified Real-World Scenarios
Changtao Miao, Yi Zhang, Weize Gao, Zhiya Tan, Weiwei Feng, Man Luo, Jianshu Li, Ajian Liu, Yunfeng Diao, Qi Chu, Tao Gong, Zhe Li, Weibin Yao, Joey Tianyi Zhou
Recent advances in AIGC have exacerbated the misuse of malicious deepfake
content, making the development of reliable deepfake detection methods an
essential means to address this challenge. Although existing deepfake detection
models demonstrate outstanding performance in detection metrics, most methods
only provide simple binary classification results, lacking interpretability.
Recent studies have attempted to enhance the interpretability of classification
results by providing spatial manipulation masks or temporal forgery segments.
However, due to the limitations of forgery datasets, the practical
effectiveness of these methods remains suboptimal. The primary reason lies in
the fact that most existing deepfake datasets contain only binary labels, with
limited variety in forgery scenarios, insufficient diversity in deepfake types,
and relatively small data scales, making them inadequate for complex real-world
scenarios.To address this predicament, we construct a novel large-scale
deepfake detection and localization (\textbf{DDL}) dataset containing over
$\textbf{1.4M+}$ forged samples and encompassing up to $\textbf{80}$ distinct
deepfake methods. The DDL design incorporates four key innovations: (1)
\textbf{Comprehensive Deepfake Methods} (covering 7 different generation
architectures and a total of 80 methods), (2) \textbf{Varied Manipulation
Modes} (incorporating 7 classic and 3 novel forgery modes), (3) \textbf{Diverse
Forgery Scenarios and Modalities} (including 3 scenarios and 3 modalities), and
(4) \textbf{Fine-grained Forgery Annotations} (providing 1.18M+ precise spatial
masks and 0.23M+ precise temporal segments).Through these improvements, our DDL
not only provides a more challenging benchmark for complex real-world forgeries
but also offers crucial support for building next-generation deepfake
detection, localization, and interpretability methods.
comment: This paper is a preliminary version, with an extended and
comprehensive version currently under development
♻ ☆ HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis
Audio-driven talking head video generation enhances user engagement in
human-computer interaction. However, current methods frequently produce videos
with motion blur and lip jitter, primarily due to their reliance on implicit
modeling of audio-facial motion correlations--an approach lacking explicit
articulatory priors (i.e., anatomical guidance for speech-related facial
movements). To overcome this limitation, we propose HM-Talker, a novel
framework for generating high-fidelity, temporally coherent talking heads.
HM-Talker leverages a hybrid motion representation combining both implicit and
explicit motion cues. Explicit cues use Action Units (AUs), anatomically
defined facial muscle movements, alongside implicit features to minimize
phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement
Module (CMDM) extracts complementary implicit/explicit motion features while
predicting AUs directly from audio input aligned to visual cues. To mitigate
identity-dependent biases in explicit features and enhance cross-subject
generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This
module dynamically merges randomly paired implicit/explicit features, enforcing
identity-agnostic learning. Together, these components enable robust lip
synchronization across diverse identities, advancing personalized talking head
synthesis. Extensive experiments demonstrate HM-Talker's superiority over
state-of-the-art methods in visual quality and lip-sync accuracy.
♻ ☆ MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Dense Video Object Captioning (DVOC) is the task of jointly detecting,
tracking, and captioning object trajectories in a video, requiring the ability
to understand spatio-temporal details and describe them in natural language.
Due to the complexity of the task and the high cost associated with manual
annotation, previous approaches resort to disjoint training strategies,
potentially leading to suboptimal performance. To circumvent this issue, we
propose to generate captions about spatio-temporally localized entities
leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets
with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an
end-to-end model capable of jointly detecting, segmenting, tracking and
captioning object trajectories. Moreover, with pretraining on LVISCap and
LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three
existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are
available at https://www.gabriel.fiastre.fr/maskcaptioner/.
comment: 20 pages, 8 figures
♻ ☆ LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution
Generative models for Image Super-Resolution (SR) are increasingly powerful,
yet their reliance on self-attention's quadratic complexity (O(N^2)) creates a
major computational bottleneck. Linear Attention offers an O(N) solution, but
its promise for photorealistic SR has remained largely untapped, historically
hindered by a cascade of interrelated and previously unsolved challenges. This
paper introduces LinearSR, a holistic framework that, for the first time,
systematically overcomes these critical hurdles. Specifically, we resolve a
fundamental, training instability that causes catastrophic model divergence
using our novel "knee point"-based Early-Stopping Guided Fine-tuning (ESGF)
strategy. Furthermore, we mitigate the classic perception-distortion trade-off
with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we
establish an effective and lightweight guidance paradigm, TAG, derived from our
"precision-over-volume" principle. Our resulting LinearSR model simultaneously
delivers state-of-the-art perceptual quality with exceptional efficiency. Its
core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its
overall multi-step inference time remains highly competitive. This work
provides the first robust methodology for applying Linear Attention in the
photorealistic SR domain, establishing a foundational paradigm for future
research in efficient generative super-resolution.
comment: 19 pages, 9 figures, 6 tables
♻ ☆ UV-Attack: Physical-World Adversarial Attacks for Person Detection via Dynamic-NeRF-based UV Mapping ICLR2025
In recent research, adversarial attacks on person detectors using patches or
static 3D model-based texture modifications have struggled with low success
rates due to the flexible nature of human movement. Modeling the 3D
deformations caused by various actions has been a major challenge. Fortunately,
advancements in Neural Radiance Fields (NeRF) for dynamic human modeling offer
new possibilities. In this paper, we introduce UV-Attack, a groundbreaking
approach that achieves high success rates even with extensive and unseen human
actions. We address the challenge above by leveraging dynamic-NeRF-based UV
mapping. UV-Attack can generate human images across diverse actions and
viewpoints, and even create novel actions by sampling from the SMPL parameter
space. While dynamic NeRF models are capable of modeling human bodies,
modifying clothing textures is challenging because they are embedded in neural
network parameters. To tackle this, UV-Attack generates UV maps instead of RGB
images and modifies the texture stacks. This approach enables real-time texture
edits and makes the attack more practical. We also propose a novel Expectation
over Pose Transformation loss (EoPT) to improve the evasion success rate on
unseen poses and views. Our experiments show that UV-Attack achieves a 92.7%
attack success rate against the FastRCNN model across varied poses in dynamic
video settings, significantly outperforming the state-of-the-art AdvCamou
attack, which only had a 28.5% ASR. Moreover, we achieve 49.5% ASR on the
latest YOLOv8 detector in black-box settings. This work highlights the
potential of dynamic NeRF-based UV mapping for creating more effective
adversarial attacks on person detectors, addressing key challenges in modeling
human movement and texture modification. The code is available at
https://github.com/PolyLiYJ/UV-Attack.
comment: 23 pages, 22 figures, accepted by ICLR2025
♻ ☆ A Continuous and Interpretable Morphometric for Robust Quantification of Dynamic Biological Shapes
Roua Rouatbi, Juan-Esteban Suarez Cardona, Alba Villaronga-Luque, Jesse V. Veenvliet, Ivo F. Sbalzarini
We introduce the Push-Forward Signed Distance Morphometric (PF-SDM) for shape
quantification in biomedical imaging. The PF-SDM compactly encodes geometric
and topological properties of closed shapes, including their skeleton and
symmetries. This provides robust and interpretable features for shape
comparison and machine learning. The PF-SDM is mathematically smooth, providing
access to gradients and differential-geometric quantities. It also extends to
temporal dynamics and allows fusing spatial intensity distributions, such as
genetic markers, with shape dynamics. We present the PF-SDM theory, benchmark
it on synthetic data, and apply it to predicting body-axis formation in mouse
gastruloids, outperforming a CNN baseline in both accuracy and speed.
♻ ☆ VerifIoU - Robustness of Object Detection to Perturbations
Noémie Cohen, Mélanie Ducoffe, Ryma Boumazouza, Christophe Gabreau, Claire Pagetti, Xavier Pucel, Audrey Galametz
We introduce a novel Interval Bound Propagation (IBP) approach for the formal
verification of object detection models, specifically targeting the
Intersection over Union (IoU) metric. The approach has been implemented in an
open source code, named IBP IoU, compatible with popular abstract
interpretation based verification tools. The resulting verifier is evaluated on
landing approach runway detection and handwritten digit recognition case
studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the
superior performance of IBP IoU in ensuring accuracy and stability,
contributing to more secure and robust machine learning applications.
♻ ☆ ReCon-GS: Continuum-Preserved Gaussian Streaming for Fast and Compact Reconstruction of Dynamic Scenes NeurIPS 2025
Online free-viewpoint video (FVV) reconstruction is challenged by slow
per-frame optimization, inconsistent motion estimation, and unsustainable
storage demands. To address these challenges, we propose the Reconfigurable
Continuum Gaussian Stream, dubbed ReCon-GS, a novel storage-aware framework
that enables high fidelity online dynamic scene reconstruction and real-time
rendering. Specifically, we dynamically allocate multi-level Anchor Gaussians
in a density-adaptive fashion to capture inter-frame geometric deformations,
thereby decomposing scene motion into compact coarse-to-fine representations.
Then, we design a dynamic hierarchy reconfiguration strategy that preserves
localized motion expressiveness through on-demand anchor re-hierarchization,
while ensuring temporal consistency through intra-hierarchical deformation
inheritance that confines transformation priors to their respective hierarchy
levels. Furthermore, we introduce a storage-aware optimization mechanism that
flexibly adjusts the density of Anchor Gaussians at different hierarchy levels,
enabling a controllable trade-off between reconstruction fidelity and memory
usage. Extensive experiments on three widely used datasets demonstrate that,
compared to state-of-the-art methods, ReCon-GS improves training efficiency by
approximately 15% and achieves superior FVV synthesis quality with enhanced
robustness and stability. Moreover, at equivalent rendering quality, ReCon-GS
slashes memory requirements by over 50% compared to leading state-of-the-art
methods.
comment: Published in NeurIPS 2025
♻ ☆ StyleGuard: Preventing Text-to-Image-Model-based Style Mimicry Attacks by Style Perturbations NIPS2025
Recently, text-to-image diffusion models have been widely used for style
mimicry and personalized customization through methods such as DreamBooth and
Textual Inversion. This has raised concerns about intellectual property
protection and the generation of deceptive content. Recent studies, such as
Glaze and Anti-DreamBooth, have proposed using adversarial noise to protect
images from these attacks. However, recent purification-based methods, such as
DiffPure and Noise Upscaling, have successfully attacked these latest defenses,
showing the vulnerabilities of these methods. Moreover, present methods show
limited transferability across models, making them less effective against
unknown text-to-image models. To address these issues, we propose a novel
anti-mimicry method, StyleGuard. We propose a novel style loss that optimizes
the style-related features in the latent space to make it deviate from the
original image, which improves model-agnostic transferability. Additionally, to
enhance the perturbation's ability to bypass diffusion-based purification, we
designed a novel upscale loss that involves ensemble purifiers and upscalers
during training. Extensive experiments on the WikiArt and CelebA datasets
demonstrate that StyleGuard outperforms existing methods in robustness against
various transformations and purifications, effectively countering style mimicry
in various models. Moreover, StyleGuard is effective on different style mimicry
methods, including DreamBooth and Textual Inversion. The code is available at
https://github.com/PolyLiYJ/StyleGuard.
comment: Accepted by NIPS2025
♻ ☆ LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification
As an important task in intelligent transportation systems, Aerial-Ground
person Re-IDentification (AG-ReID) aims to retrieve specific persons across
heterogeneous cameras in different viewpoints. Previous methods typically adopt
deep learning-based models, focusing on extracting view-invariant features.
However, they usually overlook the semantic information in person attributes.
In addition, existing training strategies often rely on full fine-tuning
large-scale models, which significantly increases training costs. To address
these issues, we propose a novel framework named LATex for AG-ReID, which
adopts prompt-tuning strategies to leverage attribute-based text knowledge.
Specifically, with the Contrastive Language-Image Pre-training (CLIP) model, we
first propose an Attribute-aware Image Encoder (AIE) to extract both global
semantic features and attribute-aware features from input images. Then, with
these features, we propose a Prompted Attribute Classifier Group (PACG) to
predict person attributes and obtain attribute representations. Finally, we
design a Coupled Prompt Template (CPT) to transform attribute representations
and view information into structured sentences. These sentences are processed
by the text encoder of CLIP to generate more discriminative features. As a
result, our framework can fully leverage attribute-based text knowledge to
improve AG-ReID performance. Extensive experiments on three AG-ReID benchmarks
demonstrate the effectiveness of our proposed methods. The source code is
available at https://github.com/kevinhu314/LATex.
comment: More modifications may be performed
♻ ☆ EmoAttack: Emotion-to-Image Diffusion Models for Emotional Backdoor Generation
Text-to-image diffusion models can generate realistic images based on textual
inputs, enabling users to convey their opinions visually through language.
Meanwhile, within language, emotion plays a crucial role in expressing personal
opinions in our daily lives and the inclusion of maliciously negative content
can lead users astray, exacerbating negative emotions. Recognizing the success
of diffusion models and the significance of emotion, we investigate a
previously overlooked risk associated with text-to-image diffusion models, that
is, utilizing emotion in the input texts to introduce negative content and
provoke unfavorable emotions in users. Specifically, we identify a new backdoor
attack, i.e., emotion-aware backdoor attack (EmoAttack), which introduces
malicious negative content triggered by emotional texts during image
generation. We formulate such an attack as a diffusion personalization problem
to avoid extensive model retraining and propose the EmoBooth. Unlike existing
personalization methods, our approach fine-tunes a pre-trained diffusion model
by establishing a mapping between a cluster of emotional words and a given
reference image containing malicious negative content. To validate the
effectiveness of our method, we built a dataset and conducted extensive
analysis and discussion about its effectiveness. Given consumers' widespread
use of diffusion models, uncovering this threat is critical for society.
♻ ☆ Two Heads are Better than One: Robust Learning Meets Multi-branch Models
Zongyuan Zhang, Qingwen Bu, Tianyang Duan, Zheng Lin, Yuhao Qing, Zihan Fang, Heming Cui, Dong Huang
Deep neural networks (DNNs) are vulnerable to adversarial examples, in which
DNNs are misled to false outputs due to inputs containing imperceptible
perturbations. Adversarial training, a reliable and effective method of
defense, may significantly reduce the vulnerability of neural networks and
becomes the de facto standard for robust learning. While many recent works
practice the data-centric philosophy, such as how to generate better
adversarial examples or use generative models to produce additional training
data, we look back to the models themselves and revisit the adversarial
robustness from the perspective of deep feature distribution as an insightful
complementarity. In this paper, we propose \textit{Branch Orthogonality
adveRsarial Training} (BORT) to obtain state-of-the-art performance with solely
the original dataset for adversarial training. To practice our design idea of
integrating multiple orthogonal solution spaces, we leverage a simple and
straightforward multi-branch neural network that eclipses adversarial attacks
with no increase in inference time. We heuristically propose a corresponding
loss function, branch-orthogonal loss, to make each solution space of the
multi-branch model orthogonal. We evaluate our approach on CIFAR-10, CIFAR-100
and SVHN against $\ell_{\infty}$ norm-bounded perturbations of size $\epsilon =
8/255$, respectively. Exhaustive experiments are conducted to show that our
method goes beyond all state-of-the-art methods without any tricks. Compared to
all methods that do not use additional data for training, our models achieve
67.3\% and 41.5\% robust accuracy on CIFAR-10 and CIFAR-100 (improving upon the
state-of-the-art by +7.23\% and +9.07\%). We also outperform methods using a
training set with a far larger scale than ours.
comment: Camera-ready version for ICPADS 2025
♻ ☆ Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data
Dynamic facial expression recognition (DFER) infers emotions from the
temporal evolution of expressions, unlike static facial expression recognition
(SFER), which relies solely on a single snapshot. This temporal analysis
provides richer information and promises greater recognition capability.
However, current DFER methods often exhibit unsatisfied performance largely due
to fewer training samples compared to SFER. Given the inherent correlation
between static and dynamic expressions, we hypothesize that leveraging the
abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic
(S4D), a unified dual-modal learning framework that integrates SFER data as a
complementary resource for DFER. Specifically, S4D employs dual-modal
self-supervised pre-training on facial images and videos using a shared Vision
Transformer (ViT) encoder-decoder architecture, yielding improved
spatiotemporal representations. The pre-trained encoder is then fine-tuned on
static and dynamic expression datasets in a multi-task learning setup to
facilitate emotional information interaction. Unfortunately, vanilla multi-task
learning in our study results in negative transfer. To address this, we propose
an innovative Mixture of Adapter Experts (MoAE) module that facilitates
task-specific knowledge acquisition while effectively extracting shared
knowledge from both static and dynamic expression data. Extensive experiments
demonstrate that S4D achieves a deeper understanding of DFER, setting new
state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with
weighted average recall (WAR) of 53.65\%, 58.44\%, and 76.68\%, respectively.
Additionally, a systematic correlation analysis between SFER and DFER tasks is
presented, which further elucidates the potential benefits of leveraging SFER.
comment: The code and model are publicly available here
https://github.com/MSA-LMC/S4D
♻ ☆ Tunable-Generalization Diffusion Powered by Self-Supervised Contextual Sub-Data for Low-Dose CT Reconstruction
Current models based on deep learning for low-dose CT denoising rely heavily
on paired data and generalize poorly. Even the more concerned diffusion models
need to learn the distribution of clean data for reconstruction, which is
difficult to satisfy in medical clinical applications. At the same time,
self-supervised-based methods face the challenge of significant degradation of
generalizability of models pre-trained for the current dose to expand to other
doses. To address these issues, this work proposes a novel method of
TUnable-geneRalizatioN Diffusion (TurnDiff) powered by self-supervised
contextual sub-data for low-dose CT reconstruction. Firstly, a contextual
subdata self-enhancing similarity strategy is designed for denoising centered
on the LDCT projection domain, which provides an initial prior for the
subsequent progress. Subsequently, the initial prior is used to combine
knowledge distillation with a deep combination of latent diffusion models for
optimizing image details. The pre-trained model is used for inference
reconstruction, and the pixel-level self-correcting fusion technique is
proposed for fine-grained reconstruction of the image domain to enhance the
image fidelity, using the initial prior and the LDCT image as a guide. In
addition, the technique is flexibly applied to the generalization of upper and
lower doses or even unseen doses. Dual-domain strategy cascade for
self-supervised LDCT denoising, TurnDiff requires only LDCT projection domain
data for training and testing. Comprehensive evaluation on both benchmark
datasets and real-world data demonstrates that TurnDiff consistently
outperforms state-of-the-art methods in both reconstruction and generalization.
♻ ☆ Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction
Physics Informed Neural Networks (PINNs) are gaining attention for their
ability to embed physical laws into deep learning models, which is particularly
useful in structural engineering tasks with limited data. This paper aims to
explore the use of PINNs to predict the weight of small scale spaghetti
bridges, a task relevant to understanding load limits and potential failure
modes in simplified structural models. Our proposed framework incorporates
physics-based constraints to the prediction model for improved performance. In
addition to standard PINNs, we introduce a novel architecture named Physics
Informed Kolmogorov Arnold Network (PIKAN), which blends universal function
approximation theory with physical insights. The structural parameters provided
as input to the model are collected either manually or through computer vision
methods. Our dataset includes 15 real bridges, augmented to 100 samples, and
our best model achieves an $R^2$ score of 0.9603 and a mean absolute error
(MAE) of 10.50 units. From applied perspective, we also provide a web based
interface for parameter entry and prediction. These results show that PINNs can
offer reliable estimates of structural weight, even with limited data, and may
help inform early stage failure analysis in lightweight bridge designs.
The complete data and code are available at
https://github.com/OmerJauhar/PINNS-For-Spaghetti-Bridges.
comment: 12 pages, 17 figures. Preprint
♻ ☆ SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification
Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific
persons across cameras with different viewpoints. Previous works focus on
designing discriminative models to maintain the identity consistency despite
drastic changes in camera viewpoints. The core idea behind these methods is
quite natural, but designing a view-robust model is a very challenging task.
Moreover, they overlook the contribution of view-specific features in enhancing
the model's ability to represent persons. To address these issues, we propose a
novel generative framework named SD-ReID for AG-ReID, which leverages
generative models to mimic the feature distribution of different views while
extracting robust identity representations. More specifically, we first train a
ViT-based model to extract person representations along with controllable
conditions, including identity and view conditions. We then fine-tune the
Stable Diffusion (SD) model to enhance person representations guided by these
controllable conditions. Furthermore, we introduce the View-Refined Decoder
(VRD) to bridge the gap between instance-level and global-level features.
Finally, both person representations and all-view features are employed to
retrieve target persons. Extensive experiments on five AG-ReID benchmarks
(i.e., CARGO, AG-ReIDv1, AG-ReIDv2, LAGPeR and G2APS-ReID) demonstrate the
effectiveness of our proposed method. The source code will be available.
comment: More modifications may performed
♻ ☆ TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection EMNLP 2025
Multimodal misinformation, encompassing textual, visual, and cross-modal
distortions, poses an increasing societal threat that is amplified by
generative AI. Existing methods typically focus on a single type of distortion
and struggle to generalize to unseen scenarios. In this work, we observe that
different distortion types share common reasoning capabilities while also
requiring task-specific skills. We hypothesize that joint training across
distortion types facilitates knowledge sharing and enhances the model's ability
to generalize. To this end, we introduce TRUST-VL, a unified and explainable
vision-language model for general multimodal misinformation detection. TRUST-VL
incorporates a novel Question-Aware Visual Amplifier module, designed to
extract task-specific visual features. To support training, we also construct
TRUST-Instruct, a large-scale instruction dataset containing 198K samples
featuring structured reasoning chains aligned with human fact-checking
workflows. Extensive experiments on both in-domain and zero-shot benchmarks
demonstrate that TRUST-VL achieves state-of-the-art performance, while also
offering strong generalization and interpretability.
comment: EMNLP 2025 Oral; Project Homepage:
https://yanzehong.github.io/trust-vl/
♻ ☆ Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Academic poster generation is a crucial yet challenging task in scientific
communication, requiring the compression of long-context interleaved documents
into a single, visually coherent page. To address this challenge, we introduce
the first benchmark and metric suite for poster generation, which pairs recent
conference papers with author-designed posters and evaluates outputs on
(i)Visual Quality-semantic alignment with human posters, (ii)Textual
Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic
and informational criteria scored by a VLM-as-judge, and notably
(iv)PaperQuiz-the poster's ability to convey core paper content as measured by
VLMs answering generated quizzes. Building on this benchmark, we propose
PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser
distills the paper into a structured asset library; the (b)Planner aligns
text-visual pairs into a binary-tree layout that preserves reading order and
spatial balance; and the (c)Painter-Commenter loop refines each panel by
executing rendering code and using VLM feedback to eliminate overflow and
ensure alignment. In our comprehensive evaluation, we find that GPT-4o
outputs-though visually appealing at first glance-often exhibit noisy text and
poor PaperQuiz scores, and we find that reader engagement is the primary
aesthetic bottleneck, as human-designed posters rely largely on visual
semantics to convey meaning. Our fully open-source variants (e.g. based on the
Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across
nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper
into a finalized yet editable .pptx poster - all for just $0.005. These
findings chart clear directions for the next generation of fully automated
poster-generation models. The code and datasets are available at
https://github.com/Paper2Poster/Paper2Poster.
comment: Project Page: https://github.com/Paper2Poster/Paper2Poster
♻ ☆ MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? NeurIPS'25
Large foundation models face challenges in acquiring transferable, structured
thinking abilities, especially when supervised with rigid templates or
crowd-annotated instruction datasets. Unlike prior approaches, we focus on a
thinking-centric data synthesis paradigm that enables models to evolve through
self-generated, cognitively guided data. We propose MindGYM, a structured and
scalable framework for question synthesis, composed of: (1) Cognitive Thinking
Process Injection, which infuses high-level reasoning objectives to shape the
model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating
atomic questions from diverse semantic types to encourage broader thinking; and
(3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop
questions based on QA seeds for deeper reasoning. Detailed analysis shows that
synthetic data generated by our method achieves 16.7% higher average quality
and 67.91% lower quality variance compared to baseline sources, highlighting
that both high-quality and self-contained data are essential for effective,
thinking-oriented fine-tuning. MindGYM improves performance on six reasoning
benchmarks, achieving gains of up to 16% on MathVision using only 400 data
samples, and generalizable improvements across different model sizes and
architectures. MindGYM underscores the viability of self-challenging mechanisms
in refining large model capabilities while minimizing human intervention and
resource demands. Code and data are released to promote data-centric research
into self-evolving foundation models driven by their internal reasoning
capabilities.
comment: Accepted by NeurIPS'25. 30 pages, 2 figures, 13 tables
♻ ☆ D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning - A Benchmark Dataset and Method ICDM
Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar
Dark humor in online memes poses unique challenges due to its reliance on
implicit, sensitive, and culturally contextual cues. To address the lack of
resources and methods for detecting dark humor in multimodal content, we
introduce a novel dataset of 4,379 Reddit memes annotated for dark humor,
target category (gender, mental health, violence, race, disability, and other),
and a three-level intensity rating (mild, moderate, severe). Building on this
resource, we propose a reasoning-augmented framework that first generates
structured explanations for each meme using a Large Vision-Language Model
(VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective
to iteratively refine its explanations, ensuring completeness and alignment. We
then extract textual features from both the OCR transcript and the self-refined
reasoning via a text encoder, while visual features are obtained using a vision
transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three
streams, text, image, and reasoning, via pairwise attention mechanisms,
producing a unified representation for classification. Experimental results
demonstrate that our approach outperforms strong baselines across three tasks:
dark humor detection, target identification, and intensity prediction. The
dataset, annotations, and code are released to facilitate further research in
multimodal humor understanding and content moderation. Code and Dataset are
available at:
https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
comment: Accepted at IEEE International Conference on Data Mining (ICDM) 2025
♻ ☆ Disentangled 4D Gaussian Splatting: Rendering High-Resolution Dynamic World at 343 FPS
While dynamic novel view synthesis from 2D videos has seen progress,
achieving efficient reconstruction and rendering of dynamic scenes remains a
challenging task. In this paper, we introduce Disentangled 4D Gaussian
Splatting (Disentangled4DGS), a novel representation and rendering pipeline
that achieves real-time performance without compromising visual fidelity.
Disentangled4DGS decouples the temporal and spatial components of 4D Gaussians,
avoiding the need for slicing first and four-dimensional matrix calculations in
prior methods. By projecting temporal and spatial deformations into dynamic 2D
Gaussians and deferring temporal processing, we minimize redundant computations
of 4DGS. Our approach also features a gradient-guided flow loss and temporal
splitting strategy to reduce artifacts. Experiments demonstrate a significant
improvement in rendering speed and quality, achieving 343 FPS when render
1352*1014 resolution images on a single RTX3090 while reducing storage
requirements by at least 4.5%. Our approach sets a new benchmark for dynamic
novel view synthesis, outperforming existing methods on both multi-view and
monocular dynamic scene datasets.
♻ ☆ GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang
Recently, GRPO-based reinforcement learning has shown remarkable progress in
optimizing flow-matching models, effectively improving their alignment with
task-specific rewards. Within these frameworks, the policy update relies on
importance-ratio clipping to constrain overconfident positive and negative
gradients. However, in practice, we observe a systematic shift in the
importance-ratio distribution-its mean falls below 1 and its variance differs
substantially across timesteps. This left-shifted and inconsistent distribution
prevents positive-advantage samples from entering the clipped region, causing
the mechanism to fail in constraining overconfident positive updates. As a
result, the policy model inevitably enters an implicit over-optimization
stage-while the proxy reward continues to increase, essential metrics such as
image quality and text-prompt alignment deteriorate sharply, ultimately making
the learned policy impractical for real-world use. To address this issue, we
introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO
frameworks. Our method incorporates ratio normalization, which restores a
balanced and step-consistent importance ratio, ensuring that PPO clipping
properly constrains harmful updates across denoising timesteps. In addition, a
gradient reweighting strategy equalizes policy gradients over noise conditions,
preventing excessive updates from particular timestep regions. Together, these
designs act as a regulated clipping mechanism, stabilizing optimization and
substantially mitigating implicit over-optimization without relying on heavy KL
regularization. Extensive experiments on multiple diffusion backbones (e.g.,
SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard
significantly reduces over-optimization while maintaining or even improving
generation quality.
comment: Project Page: https://jingw193.github.io/GRPO-Guard/
♻ ☆ RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection
Infrared small target detection is a challenging task due to its unique
characteristics (e.g., small, dim, shapeless and changeable). Recently
published CNN-based methods have achieved promising performance with heavy
feature extraction and fusion modules. To achieve efficient and effective
detection, we propose a recurrent reusable-convolution attention network
(RRCA-Net) for infrared small target detection. Specifically, RRCA-Net
incorporates reusable-convolution block (RuCB) in a recurrent manner without
introducing extra parameters. With the help of the repetitive iteration in
RuCB, the high-level information of small targets in the deep layers can be
well maintained and further refined. Then, a dual interactive attention
aggregation module (DIAAM) is proposed to promote the mutual enhancement and
fusion of refined information. In this way, RRCA-Net can both achieve
high-level feature refinement and enhance the correlation of contextual
information between adjacent layers. Moreover, to achieve steady convergence,
we design a target characteristic inspired loss function (DpT-k loss) by
integrating physical and mathematical constraints. Experimental results on
three benchmark datasets (e.g. NUAA-SIRST, IRSTD-1k, DenseSIRST) demonstrate
that our RRCA-Net can achieve comparable performance to the state-of-the-art
methods while maintaining a small number of parameters, and act as a plug and
play module to introduce consistent performance improvement for several popular
IRSTD methods.
comment: We have updated the journal reference and DOI
♻ ☆ Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jinqiang Cui, Xinlei Chen, Xiao-Ping Zhang
Spatial reasoning is a fundamental capability of multimodal large language
models (MLLMs), yet their performance in open aerial environments remains
underexplored. In this work, we present Open3D-VQA, a novel benchmark for
evaluating MLLMs' ability to reason about complex spatial relationships from an
aerial perspective. The benchmark comprises 73k QA pairs spanning 7 general
spatial reasoning tasks, including multiple-choice, true/false, and
short-answer formats, and supports both visual and point cloud modalities. The
questions are automatically generated from spatial relations extracted from
both real-world and simulated aerial scenes. Evaluation on 13 popular MLLMs
reveals that: 1) Models are generally better at answering questions about
relative spatial relations than absolute distances, 2) 3D LLMs fail to
demonstrate significant advantages over 2D LLMs, and 3) Fine-tuning solely on
the simulated dataset can significantly improve the model's spatial reasoning
performance in real-world scenarios. We release our benchmark, data generation
pipeline, and evaluation toolkit to support further research:
https://github.com/EmbodiedCity/Open3D-VQA.code.
♻ ☆ FASL-Seg: Anatomy and Tool Segmentation of Surgical Scenes ECAI
Muraam Abdel-Ghani, Mahmoud Ali, Mohamed Ali, Fatmaelzahraa Ahmed, Muhammad Arsalan, Abdulaziz Al-Ali, Shidin Balakrishnan
The growing popularity of robotic minimally invasive surgeries has made deep
learning-based surgical training a key area of research. A thorough
understanding of the surgical scene components is crucial, which semantic
segmentation models can help achieve. However, most existing work focuses on
surgical tools and overlooks anatomical objects. Additionally, current
state-of-the-art (SOTA) models struggle to balance capturing high-level
contextual features and low-level edge features. We propose a Feature-Adaptive
Spatial Localization model (FASL-Seg), designed to capture features at multiple
levels of detail through two distinct processing streams, namely a Low-Level
Feature Projection (LLFP) and a High-Level Feature Projection (HLFP) stream,
for varying feature resolutions - enabling precise segmentation of anatomy and
surgical instruments. We evaluated FASL-Seg on surgical segmentation benchmark
datasets EndoVis18 and EndoVis17 on three use cases. The FASL-Seg model
achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy
segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU
of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation,
respectively, outperforming SOTA overall performance, with comparable per-class
SOTA results in both datasets and consistent performance in various classes for
anatomy and instruments, demonstrating the effectiveness of distinct processing
streams for varying feature resolutions.
comment: 8 pages, 6 figures, In Proceedings of European Conference on
Artificial Intelligence (ECAI) 2025
♻ ☆ Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu
Visual effects (VFX) are essential visual enhancements fundamental to modern
cinematic production. Although video generation models offer cost-efficient
solutions for VFX production, current methods are constrained by per-effect
LoRA training, which limits generation to single effects. This fundamental
limitation impedes applications that require spatially controllable composite
effects, i.e., the concurrent generation of multiple effects at designated
locations. However, integrating diverse effects into a unified framework faces
major challenges: interference from effect variations and spatial
uncontrollability during multi-VFX joint training. To tackle these challenges,
we propose Omni-Effects, a first unified framework capable of generating
prompt-guided effects and spatially controllable composite effects. The core of
our framework comprises two key innovations: (1) LoRA-based Mixture of Experts
(LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects
within a unified model while effectively mitigating cross-task interference.
(2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the
text token, enabling precise spatial control. Furthermore, we introduce an
Independent-Information Flow (IIF) module integrated within the SAP, isolating
the control signals corresponding to individual effects to prevent any unwanted
blending. To facilitate this research, we construct a comprehensive VFX dataset
Omni-VFX via a novel data collection pipeline combining image editing and
First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX
evaluation framework for validating model performance. Extensive experiments
demonstrate that Omni-Effects achieves precise spatial control and diverse
effect generation, enabling users to specify both the category and location of
desired effects.
♻ ☆ Defending Multimodal Backdoored Models by Repulsive Visual Prompt Tuning
Multimodal contrastive learning models (e.g., CLIP) can learn high-quality
representations from large-scale image-text datasets, while they exhibit
significant vulnerabilities to backdoor attacks, raising serious safety
concerns. In this paper, we reveal that CLIP's vulnerabilities primarily stem
from its tendency to encode features beyond in-dataset predictive patterns,
compromising its visual feature resistivity to input perturbations. This makes
its encoded features highly susceptible to being reshaped by backdoor triggers.
To address this challenge, we propose Repulsive Visual Prompt Tuning (RVPT), a
novel defense approach that employs deep visual prompt tuning with a specially
designed feature-repelling loss. Specifically, RVPT adversarially repels the
encoded features from deeper layers while optimizing the standard cross-entropy
loss, ensuring that only predictive features in downstream tasks are encoded,
thereby enhancing CLIP's visual feature resistivity against input perturbations
and mitigating its susceptibility to backdoor attacks. Unlike existing
multimodal backdoor defense methods that typically require the availability of
poisoned data or involve fine-tuning the entire model, RVPT leverages few-shot
downstream clean samples and only tunes a small number of parameters. Empirical
results demonstrate that RVPT tunes only 0.27\% of the parameters in CLIP, yet
it significantly outperforms state-of-the-art defense methods, reducing the
attack success rate from 89.70\% to 2.76\% against the most advanced multimodal
attacks on ImageNet and effectively generalizes its defensive capabilities
across multiple datasets.
♻ ☆ FARMER: Flow AutoRegressive Transformer over Pixels
Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu
Directly modeling the explicit likelihood of the raw data distribution is key
topic in the machine learning area, which achieves the scaling successes in
Large Language Models by autoregressive modeling. However, continuous AR
modeling over visual pixel data suffer from extremely long sequences and
high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end
generative framework that unifies Normalizing Flows (NF) and Autoregressive
(AR) models for tractable likelihood estimation and high-quality image
synthesis directly from raw pixels. FARMER employs an invertible autoregressive
flow to transform images into latent sequences, whose distribution is modeled
implicitly by an autoregressive model. To address the redundancy and complexity
in pixel-level modeling, we propose a self-supervised dimension reduction
scheme that partitions NF latent channels into informative and redundant
groups, enabling more effective and efficient AR modeling. Furthermore, we
design a one-step distillation scheme to significantly accelerate inference
speed and introduce a resampling-based classifier-free guidance algorithm to
boost image generation quality. Extensive experiments demonstrate that FARMER
achieves competitive performance compared to existing pixel-based generative
models while providing exact likelihoods and scalable training.
comment: Bytedance Seed Technical Report
♻ ☆ SPARKE: Scalable Prompt-Aware Diversity and Novelty Guidance in Diffusion Models via RKE Score
Diffusion models have demonstrated remarkable success in high-fidelity image
synthesis and prompt-guided generative modeling. However, ensuring adequate
diversity in generated samples of prompt-guided diffusion models remains a
challenge, particularly when the prompts span a broad semantic spectrum and the
diversity of generated data needs to be evaluated in a prompt-aware fashion
across semantically similar prompts. Recent methods have introduced guidance
via diversity measures to encourage more varied generations. In this work, we
extend the diversity measure-based approaches by proposing the Scalable
Prompt-Aware R\'eny Kernel Entropy Diversity Guidance (SPARKE) method for
prompt-aware diversity guidance. SPARKE utilizes conditional entropy for
diversity guidance, which dynamically conditions diversity measurement on
similar prompts and enables prompt-aware diversity control. While the
entropy-based guidance approach enhances prompt-aware diversity, its reliance
on the matrix-based entropy scores poses computational challenges in
large-scale generation settings. To address this, we focus on the special case
of Conditional latent RKE Score Guidance, reducing entropy computation and
gradient-based optimization complexity from the $O(n^3)$ of general entropy
measures to $O(n)$. The reduced computational complexity allows for
diversity-guided sampling over potentially thousands of generation rounds on
different prompts. We numerically test the SPARKE method on several
text-to-image diffusion models, demonstrating that the proposed method improves
the prompt-aware diversity of the generated data without incurring significant
computational costs. We release our code on the project page:
https://mjalali.github.io/SPARKE
♻ ☆ Real-Time Neural Video Compression with Unified Intra and Inter Coding
Neural video compression (NVC) technologies have advanced rapidly in recent
years, yielding state-of-the-art schemes such as DCVC-RT that offer superior
compression efficiency to H.266/VVC and real-time encoding/decoding
capabilities. Nonetheless, existing NVC schemes have several limitations,
including inefficiency in dealing with disocclusion and new content, interframe
error propagation and accumulation, among others. To eliminate these
limitations, we borrow the idea from classic video coding schemes, which allow
intra coding within inter-coded frames. With the intra coding tool enabled,
disocclusion and new content are properly handled, and interframe error
propagation is naturally intercepted without the need for manual refresh
mechanisms. We present an NVC framework with unified intra and inter coding,
where every frame is processed by a single model that is trained to perform
intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame
compression design to exploit interframe redundancy not only forwardly but also
backwardly. Experimental results show that our scheme outperforms DCVC-RT by an
average of 12.1% BD-rate reduction, delivers more stable bitrate and quality
per frame, and retains real-time encoding/decoding performances. Code and
models will be released.
comment: 10 pages
♻ ☆ TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang, Yiheng Wang, Yun Wang, Long Zhang, Xiangyu Chen, Changzhi Sun, Jixiang Luo, Dell Zhang, Hao Sun, Chi Zhang, Xuelong Li
Egocentric AI assistants in real-world settings must process multi-modal
inputs (video, audio, text), respond in real time, and retain evolving
long-term memory. However, existing benchmarks typically evaluate these
abilities in isolation, lack realistic streaming scenarios, or support only
short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming,
omni-modal benchmark for evaluating egocentric AI assistants in realistic daily
contexts. The dataset features over 14 hours per participant of synchronized
egocentric video, audio, and text across four domains: work \& study, lifestyle
\& routines, social activities, and outings \& culture. All data is aligned on
a unified global timeline and includes high-quality visual narrations and
speech transcripts, curated through human refinement.TeleEgo defines 12
diagnostic subtasks across three core capabilities: Memory (recalling past
events), Understanding (interpreting the current moment), and Cross-Memory
Reasoning (linking distant events). It contains 3,291 human-verified QA items
spanning multiple question formats (single-choice, binary, multi-choice, and
open-ended), evaluated strictly in a streaming setting. We propose two key
metrics -- Real-Time Accuracy and Memory Persistence Time -- to jointly assess
correctness, temporal responsiveness, and long-term retention. TeleEgo provides
a realistic and comprehensive evaluation to advance the development of
practical AI assistants.
♻ ☆ DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution NeurIPS 2025
Diffusion models have demonstrated promising performance in real-world video
super-resolution (VSR). However, the dozens of sampling steps they require,
make inference extremely slow. Sampling acceleration techniques, particularly
single-step, provide a potential solution. Nonetheless, achieving one step in
VSR remains challenging, due to the high training overhead on video data and
stringent fidelity demands. To tackle the above issues, we propose DOVE, an
efficient one-step diffusion model for real-world VSR. DOVE is obtained by
fine-tuning a pretrained video diffusion model (i.e., CogVideoX). To
effectively train DOVE, we introduce the latent-pixel training strategy. The
strategy employs a two-stage scheme to gradually adapt the model to the video
super-resolution task. Meanwhile, we design a video processing pipeline to
construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning
on this dataset further enhances the restoration capability of DOVE. Extensive
experiments show that DOVE exhibits comparable or superior performance to
multi-step diffusion-based VSR methods. It also offers outstanding inference
efficiency, achieving up to a 28$\times$ speed-up over existing methods such as
MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.
comment: Accepted to NeurIPS 2025. Code is available at:
https://github.com/zhengchen1999/DOVE
♻ ☆ Language-guided Open-world Video Anomaly Detection under Weak Supervision
Video anomaly detection (VAD) aims to detect anomalies that deviate from what
is expected. In open-world scenarios, the expected events may change as
requirements change. For example, not wearing a mask may be considered abnormal
during a flu outbreak but normal otherwise. However, existing methods assume
that the definition of anomalies is invariable, and thus are not applicable to
the open world. To address this, we propose a novel open-world VAD paradigm
with variable definitions, allowing guided detection through user-provided
natural language at inference time. This paradigm necessitates establishing a
robust mapping from video and textual definition to anomaly scores. Therefore,
we propose LaGoVAD (Language-guided Open-world Video Anomaly Detector), a model
that dynamically adapts anomaly definitions under weak supervision with two
regularization strategies: diversifying the relative durations of anomalies via
dynamic video synthesis, and enhancing feature robustness through contrastive
learning with negative mining. Training such adaptable models requires diverse
anomaly definitions, but existing datasets typically provide labels without
semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training
Video Anomaly Dataset), the largest and most diverse video anomaly dataset to
date, featuring 35,279 annotated videos with multi-level category labels and
descriptions that explicitly define anomalies. Zero-shot experiments on seven
datasets demonstrate LaGoVAD's SOTA performance. Our dataset and code will be
released at https://github.com/Kamino666/LaGoVAD-PreVAD.
♻ ☆ Buffer layers for Test-Time Adaptation NeurIPS 2025
In recent advancements in Test Time Adaptation (TTA), most existing
methodologies focus on updating normalization layers to adapt to the test
domain. However, the reliance on normalization-based adaptation presents key
challenges. First, normalization layers such as Batch Normalization (BN) are
highly sensitive to small batch sizes, leading to unstable and inaccurate
statistics. Moreover, normalization-based adaptation is inherently constrained
by the structure of the pre-trained model, as it relies on training-time
statistics that may not generalize well to unseen domains. These issues limit
the effectiveness of normalization-based TTA approaches, especially under
significant domain shift. In this paper, we introduce a novel paradigm based on
the concept of a Buffer layer, which addresses the fundamental limitations of
normalization layer updates. Unlike existing methods that modify the core
parameters of the model, our approach preserves the integrity of the
pre-trained backbone, inherently mitigating the risk of catastrophic forgetting
during online adaptation. Through comprehensive experimentation, we demonstrate
that our approach not only outperforms traditional methods in mitigating domain
shift and enhancing model robustness, but also exhibits strong resilience to
forgetting. Furthermore, our Buffer layer is modular and can be seamlessly
integrated into nearly all existing TTA frameworks, resulting in consistent
performance improvements across various architectures. These findings validate
the effectiveness and versatility of the proposed solution in real-world domain
adaptation scenarios. The code is available at
https://github.com/hyeongyu-kim/Buffer_TTA.
comment: Accepted at NeurIPS 2025
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for
autonomous systems but remains a challenging task due to the need for
adaptability in different environments and domains. A common approach involves
collecting scenario-specific data and performing fine-tuning via
backpropagation. However, the need to fine-tune for each new scenario is often
impractical for deployment on edge devices. To address this challenge, we
introduce \paper, an In-Context Learning (ICL) framework for pedestrian
trajectory prediction that enables adaptation without fine-tuning on the
scenario-specific data at inference time without requiring weight updates. We
propose a spatio-temporal similarity-based example selection (STES) method that
selects relevant examples from previously observed trajectories within the same
scene by identifying similar motion patterns at corresponding locations. To
further refine this selection, we introduce prediction-guided example selection
(PG-ES), which selects examples based on both the past trajectory and the
predicted future trajectory, rather than relying solely on the past trajectory.
This approach allows the model to account for long-term dynamics when selecting
examples. Finally, instead of relying on small real-world datasets with limited
scenario diversity, we train our model on a large-scale synthetic dataset to
enhance its prediction ability by leveraging in-context examples. Extensive
experiments demonstrate that TrajICL achieves remarkable adaptation across both
in-domain and cross-domain scenarios, outperforming even fine-tuned approaches
across multiple public benchmarks. Project Page:
https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ SPLite Hand: Sparsity-Aware Lightweight 3D Hand Pose Estimation
With the increasing ubiquity of AR/VR devices, the deployment of deep
learning models on edge devices has become a critical challenge. These devices
require real-time inference, low power consumption, and minimal latency. Many
framework designers face the conundrum of balancing efficiency and performance.
We design a light framework that adopts an encoder-decoder architecture and
introduces several key contributions aimed at improving both efficiency and
accuracy. We apply sparse convolution on a ResNet-18 backbone to exploit the
inherent sparsity in hand pose images, achieving a 42% end-to-end efficiency
improvement. Moreover, we propose our SPLite decoder. This new architecture
significantly boosts the decoding process's frame rate by 3.1x on the Raspberry
Pi 5, while maintaining accuracy on par. To further optimize performance, we
apply quantization-aware training, reducing memory usage while preserving
accuracy (PA-MPJPE increases only marginally from 9.0 mm to 9.1 mm on
FreiHAND). Overall, our system achieves a 2.98x speed-up on a Raspberry Pi 5
CPU (BCM2712 quad-core Arm A76 processor). Our method is also evaluated on
compound benchmark datasets, demonstrating comparable accuracy to
state-of-the-art approaches while significantly enhancing computational
efficiency.
comment: Accepted to AICCC 2025
♻ ☆ From One to More: Contextual Part Latents for 3D Generation
Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jaehyeok Kim, Chenjian Gao, Zhanpeng Huang, Zibin Wang, Tianfan Xue, Dan Xu
Recent advances in 3D generation have transitioned from multi-view 2D
rendering approaches to 3D-native latent diffusion frameworks that exploit
geometric priors in ground truth data. Despite progress, three key limitations
persist: (1) Single-latent representations fail to capture complex multi-part
geometries, causing detail degradation; (2) Holistic latent coding neglects
part independence and interrelationships critical for compositional design; (3)
Global conditioning mechanisms lack fine-grained controllability. Inspired by
human 3D design workflows, we propose CoPart - a part-aware diffusion framework
that decomposes 3D objects into contextual part latents for coherent multi-part
generation. This paradigm offers three advantages: i) Reduces encoding
complexity through part decomposition; ii) Enables explicit part relationship
modeling; iii) Supports part-level conditioning. We further develop a mutual
guidance strategy to fine-tune pre-trained diffusion models for joint part
latent denoising, ensuring both geometric coherence and foundation model
priors. To enable large-scale training, we construct Partverse - a novel 3D
part dataset derived from Objaverse through automated mesh segmentation and
human-verified annotations. Extensive experiments demonstrate CoPart's superior
capabilities in part-level editing, articulated object generation, and scene
composition with unprecedented controllability.
comment: Project page: https://copart3d.github.io/
♻ ☆ Cycle Diffusion Model for Counterfactual Image Generation
Fangrui Huang, Alan Wang, Binxu Li, Bailey Trang, Ridvan Yesiloglu, Tianyu Hua, Wei Peng, Ehsan Adeli
Deep generative models have demonstrated remarkable success in medical image
synthesis. However, ensuring conditioning faithfulness and high-quality
synthetic images for direct or counterfactual generation remains a challenge.
In this work, we introduce a cycle training framework to fine-tune diffusion
models for improved conditioning adherence and enhanced synthetic image
realism. Our approach, Cycle Diffusion Model (CDM), enforces consistency
between generated and original images by incorporating cycle constraints,
enabling more reliable direct and counterfactual generation. Experiments on a
combined 3D brain MRI dataset (from ABCD, HCP aging & young adults, ADNI, and
PPMI) show that our method improves conditioning accuracy and enhances image
quality as measured by FID and SSIM. The results suggest that the cycle
strategy used in CDM can be an effective method for refining diffusion-based
medical image generation, with applications in data augmentation,
counterfactual, and disease progression modeling.
♻ ☆ Empowering Agentic Video Analytics Systems with Video Language Models
AI-driven video analytics has become increasingly important across diverse
domains. However, existing systems are often constrained to specific,
predefined tasks, limiting their adaptability in open-ended analytical
scenarios. The recent emergence of Vision Language Models (VLMs) as
transformative technologies offers significant potential for enabling
open-ended video understanding, reasoning, and analytics. Nevertheless, their
limited context windows present challenges when processing ultra-long video
content, which is prevalent in real-world applications. To address this, we
introduce AVA, a VLM-powered system designed for open-ended, advanced video
analytics. AVA incorporates two key innovations: (1) the near real-time
construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or
continuous video streams, and (2) an agentic retrieval-generation mechanism
that leverages EKGs to handle complex and diverse queries. Comprehensive
evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that
AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy,
respectively-significantly surpassing existing VLM and video
Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video
analytics in ultra-long and open-world video scenarios, we introduce a new
benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours
in duration, along with 120 manually annotated, diverse, and complex
question-answer pairs. On AVA-100, AVA achieves top-tier performance with an
accuracy of 75.8%. The source code of AVA is available at
https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at
https://huggingface.co/datasets/iesc/Ava-100.
comment: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations
and appendix
♻ ☆ Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations NeurIPS 2025
Pre-trained stable diffusion models (SD) have shown great advances in visual
correspondence. In this paper, we investigate the capabilities of Diffusion
Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs
exhibit a critical phenomenon in which very few feature activations exhibit
significantly larger values than others, known as \textit{massive activations},
leading to uninformative representations and significant performance
degradation for DiTs. The massive activations consistently concentrate at very
few fixed dimensions across all image patch tokens, holding little local
information. We trace these dimension-concentrated massive activations and find
that such concentration can be effectively localized by the zero-initialized
Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose
Diffusion Transformer Feature (DiTF), a training-free framework designed to
extract semantic-discriminative features from DiTs. Specifically, DiTF employs
AdaLN to adaptively localize and normalize massive activations with
channel-wise modulation. In addition, we develop a channel discard strategy to
further eliminate the negative impacts from massive activations. Experimental
results demonstrate that our DiTF outperforms both DINO and SD-based models and
establishes a new state-of-the-art performance for DiTs in different visual
correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).
comment: NeurIPS 2025
♻ ☆ MMEdge: Accelerating On-device Multimodal Inference via Pipelined Sensing and Encoding
Real-time multimodal inference on resource-constrained edge devices is
essential for applications such as autonomous driving, human-computer
interaction, and mobile health. However, prior work often overlooks the tight
coupling between sensing dynamics and model execution, as well as the complex
inter-modality dependencies. In this paper, we propose MMEdge, an new on-device
multi-modal inference framework based on pipelined sensing and encoding.
Instead of waiting for complete sensor inputs, MMEdge decomposes the entire
inference process into a sequence of fine-grained sensing and encoding units,
allowing computation to proceed incrementally as data arrive. MMEdge also
introduces a lightweight but effective temporal aggregation module that
captures rich temporal dynamics across different pipelined units to maintain
accuracy performance. Such pipelined design also opens up opportunities for
fine-grained cross-modal optimization and early decision-making during
inference. To further enhance system performance under resource variability and
input data complexity, MMEdge incorporates an adaptive multimodal configuration
optimizer that dynamically selects optimal sensing and model configurations for
each modality under latency constraints, and a cross-modal speculative skipping
mechanism that bypasses future units of slower modalities when early
predictions reach sufficient confidence. We evaluate MMEdge using two public
multimodal datasets and deploy it on a real-world unmanned aerial vehicle
(UAV)-based multimodal testbed. The results show that MMEdge significantly
reduces end-to-end latency while maintaining high task accuracy across various
system and data dynamics.
comment: Code available at: https://github.com/HKUST-MINSys-Lab/MMEdge.
Accepted by SenSys 2026
♻ ☆ Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features ICCV 2025
The ability of deep neural networks (DNNs) come from extracting and
interpreting features from the data provided. By exploiting intermediate
features in DNNs instead of relying on hard labels, we craft adversarial
perturbation that generalize more effectively, boosting black-box
transferability. These features ubiquitously come from supervised learning in
previous work. Inspired by the exceptional synergy between self-supervised
learning and the Transformer architecture, this paper explores whether
exploiting self-supervised Vision Transformer (ViT) representations can improve
adversarial transferability. We present dSVA -- a generative dual
self-supervised ViT features attack, that exploits both global structural
features from contrastive learning (CL) and local textural features from masked
image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We
design a novel generative training framework that incorporates a generator to
create black-box adversarial examples, and strategies to train the generator by
exploiting joint features and the attention mechanism of self-supervised ViTs.
Our findings show that CL and MIM enable ViTs to attend to distinct feature
tendencies, which, when exploited in tandem, boast great adversarial
generalizability. By disrupting dual deep features distilled by self-supervised
ViTs, we are rewarded with remarkable black-box transferability to models of
various architectures that outperform state-of-the-arts. Code available at
https://github.com/spencerwooo/dSVA.
comment: 14 pages, 9 figures, accepted at ICCV 2025
♻ ☆ Signal-SGN: A Spiking Graph Convolutional Network for Skeletal Action Recognition via Learning Temporal-Frequency Dynamics
For multimodal skeleton-based action recognition, Graph Convolutional
Networks (GCNs) are effective models. Still, their reliance on floating-point
computations leads to high energy consumption, limiting their applicability in
battery-powered devices. While energy-efficient, Spiking Neural Networks (SNNs)
struggle to model skeleton dynamics, leading to suboptimal solutions. We
propose Signal-SGN (Spiking Graph Convolutional Network), which utilizes the
temporal dimension of skeleton sequences as the spike time steps and represents
features as multi-dimensional discrete stochastic signals for
temporal-frequency domain feature extraction. It combines the 1D Spiking Graph
Convolution (1D-SGC) module and the Frequency Spiking Convolution (FSC) module
to extract features from the skeleton represented as spiking form.
Additionally, the Multi-Scale Wavelet Transform Feature Fusion (MWTF) module is
proposed to extract dynamic spiking features and capture frequency-specific
characteristics, enhancing classification performance. Experiments across three
large-scale datasets reveal Signal-SGN exceeding state-of-the-art SNN-based
methods in accuracy and computational efficiency while attaining comparable
performance with GCN methods and significantly reducing theoretical energy
consumption.
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Yu Gao, Anqing Jiang, Yiru Wang, Wang Jijun, Hao Jiang, Zhigang Sun, Heng Yuwen, Wang Shuo, Hao Zhao, Sun Hao
Conventional end-to-end (E2E) driving models are effective at generating
physically plausible trajectories, but often fail to generalize to long-tail
scenarios due to the lack of essential world knowledge to understand and reason
about surrounding environments. In contrast, Vision-Language-Action (VLA)
models leverage world knowledge to handle challenging cases, but their limited
3D reasoning capability can lead to physically infeasible actions. In this work
we introduce DiffVLA++, an enhanced autonomous driving framework that
explicitly bridges cognitive reasoning and E2E planning through metric-guided
alignment. First, we build a VLA module directly generating semantically
grounded driving trajectories. Second, we design an E2E module with a dense
trajectory vocabulary that ensures physical feasibility. Third, and most
critically, we introduce a metric-guided trajectory scorer that guides and
aligns the outputs of the VLA and E2E modules, thereby integrating their
complementary strengths. The experiment on the ICCV 2025 Autonomous Grand
Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models NeurIPS 2025
Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett
Chart understanding presents a unique challenge for large vision-language
models (LVLMs), as it requires the integration of sophisticated textual and
visual reasoning capabilities. However, current LVLMs exhibit a notable
imbalance between these skills, falling short on visual reasoning that is
difficult to perform in text. We conduct a case study using a synthetic dataset
solvable only through visual reasoning and show that model performance degrades
significantly with increasing visual complexity, while human performance
remains robust. We then introduce ChartMuseum, a new Chart Question Answering
(QA) benchmark containing 1,162 expert-annotated questions spanning multiple
reasoning types, curated from real-world charts across 184 sources,
specifically built to evaluate complex visual and textual reasoning. Unlike
prior chart understanding benchmarks -- where frontier models perform similarly
and near saturation -- our benchmark exposes a substantial gap between model
and human performance, while effectively differentiating model capabilities:
although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro
attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct
achieves only 38.5%. Moreover, on questions requiring primarily visual
reasoning, all models experience a 35%-55% performance drop from
text-reasoning-heavy question performance. Lastly, our qualitative error
analysis reveals specific categories of visual reasoning that are challenging
for current LVLMs.
comment: NeurIPS 2025 Datasets & Benchmarks
♻ ☆ Neighborhood Feature Pooling for Remote Sensing Image Classification
Fahimeh Orvati Nia, Amirmohammad Mohammadi, Salim Al Kharsa, Pragati Naikare, Zigfried Hampel-Arias, Joshua Peeples
In this work, we propose neighborhood feature pooling (NFP) as a novel
texture feature extraction method for remote sensing image classification. The
NFP layer captures relationships between neighboring inputs and efficiently
aggregates local similarities across feature dimensions. Implemented using
convolutional layers, NFP can be seamlessly integrated into any network.
Results comparing the baseline models and the NFP method indicate that NFP
consistently improves performance across diverse datasets and architectures
while maintaining minimal parameter overhead.
comment: 9 pages, 5 figures
♻ ☆ GCVAMD: A Modified CausalVAE Model for Causal Age-related Macular Degeneration Risk Factor Detection and Prediction
Age Related Macular Degeneration(AMD) has been one of the most leading causes
of permanent vision impairment in ophthalmology. Though treatments, such as
anti VEGF drugs or photodynamic therapies, were developed to slow down the
degenerative process of AMD, there is still no specific cure to reverse vision
loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD
or AMD itself within the patient retina in early stages is a crucial task to
reduce the possibility of vision impairment. Apart from traditional approaches,
deep learning based methods, especially attention mechanism based CNNs and
GradCAM based XAI analysis on OCT scans, exhibited successful performance in
distinguishing AMD retina from normal retinas, making it possible to use AI
driven models to aid medical diagnosis and analysis by ophthalmologists
regarding AMD. However, though having significant success, previous works
mostly focused on prediction performance itself, not pathologies or underlying
causal mechanisms of AMD, which can prohibit intervention analysis on specific
factors or even lead to less reliable decisions. Thus, this paper introduces a
novel causal AMD analysis model: GCVAMD, which incorporates a modified
CausalVAE approach that can extract latent causal factors from only raw OCT
images. By considering causality in AMD detection, GCVAMD enables causal
inference such as treatment simulation or intervention analysis regarding major
risk factors: drusen and neovascularization, while returning informative latent
causal features that can enhance downstream tasks. Results show that through
GCVAMD, drusen status and neovascularization status can be identified with AMD
causal mechanisms in GCVAMD latent spaces, which can in turn be used for
various tasks from AMD detection(classification) to intervention analysis.
♻ ☆ GameFactory: Creating New Games with Generative Interactive Videos ICCV 2025
Generative videos have the potential to revolutionize game development by
autonomously creating new content. In this paper, we present GameFactory, a
framework for action-controlled scene-generalizable game video generation. We
first address the fundamental challenge of action controllability by
introducing GF-Minecraft, an action-annotated game video dataset without human
bias, and developing an action control module that enables precise control over
both keyboard and mouse inputs. We further extend to support autoregressive
generation for unlimited-length interactive videos. More importantly,
GameFactory tackles the critical challenge of scene-generalizable action
control, which most existing methods fail to address. To enable the creation of
entirely new and diverse games beyond fixed styles and scenes, we leverage the
open-domain generative priors from pre-trained video diffusion models. To
bridge the domain gap between open-domain priors and small-scale game datasets,
we propose a multi-phase training strategy with a domain adapter that decouples
game style learning from action control. This decoupling ensures that action
control learning is no longer bound to specific game styles, thereby achieving
scene-generalizable action control. Experimental results demonstrate that
GameFactory effectively generates open-domain action-controllable game videos,
representing a significant step forward in AI-driven game generation.
comment: ICCV 2025 Highlight, Project Page:
https://yujiwen.github.io/gamefactory
♻ ☆ Reasoning Visual Language Model for Chest X-Ray Analysis
Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akcicek, Hongxu Yin, Pavlo Molchanov, Marc Edgar, Yufan He, Pengfei Guo, Yucheng Tang, Daguang Xu
Vision-language models (VLMs) have shown strong promise for medical image
analysis, but most remain opaque, offering predictions without the transparent,
stepwise reasoning clinicians rely on. We present a framework that brings
chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by
reasoning-first training paradigms, our approach is designed to learn how
experts reason, not just what they conclude, by aligning intermediate steps
with observable image evidence and radiology workflow. Beyond accuracy, the
explicit reasoning traces support clinical auditability: they reveal why a
conclusion was reached, which alternatives were considered, and where
uncertainty remains, enabling quality assurance, error analysis, and safer
human-AI collaboration.
Our model couples high-fidelity visual encoding with a two-stage training
recipe: a reasoning-style supervised fine-tuning (SFT) followed by
reinforcement learning (RL) that uses verifiable rewards over a list of X-ray
abnormalities. The model outputs reasoning that mirrors radiologists systematic
thought process, uncertainty, and differential diagnosis. In
out-of-distribution evaluation, the approach achieves competitive multi-label
classification while improving interpretability. In a reader study with expert
radiologists, full reasoning traces increased confidence, supported error
auditing, and reduced time to finalize reports. We release code and the model
NV-Reason-CXR-3B to support community progress toward trustworthy, explainable
AI in chest radiography and other medical imaging tasks where reasoning quality
is as critical as prediction quality.
comment: NV-Reason-CXR-3B
♻ ☆ VerifIoU -- Robustness of Object Detection to Perturbations SC
Noémie Cohen, Mélanie Ducoffe, Ryma Boumazouza, Christophe Gabreau, Claire Pagetti, Xavier Pucel, Audrey Galametz
We introduce a novel Interval Bound Propagation (IBP) approach for the formal
verification of object detection models, specifically targeting the
Intersection over Union (IoU) metric. The approach has been implemented in an
open source code, named IBP IoU, compatible with popular abstract
interpretation based verification tools. The resulting verifier is evaluated on
landing approach runway detection and handwritten digit recognition case
studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the
superior performance of IBP IoU in ensuring accuracy and stability,
contributing to more secure and robust machine learning applications.
comment: 44th Digital Avionics Systems Conference (DASC), Sep 2025, Montreal,
Canada
♻ ☆ D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning -- A Benchmark Dataset and Method ICDM
Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar
Dark humor in online memes poses unique challenges due to its reliance on
implicit, sensitive, and culturally contextual cues. To address the lack of
resources and methods for detecting dark humor in multimodal content, we
introduce a novel dataset of 4,379 Reddit memes annotated for dark humor,
target category (gender, mental health, violence, race, disability, and other),
and a three-level intensity rating (mild, moderate, severe). Building on this
resource, we propose a reasoning-augmented framework that first generates
structured explanations for each meme using a Large Vision-Language Model
(VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective
to iteratively refine its explanations, ensuring completeness and alignment. We
then extract textual features from both the OCR transcript and the self-refined
reasoning via a text encoder, while visual features are obtained using a vision
transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three
streams, text, image, and reasoning, via pairwise attention mechanisms,
producing a unified representation for classification. Experimental results
demonstrate that our approach outperforms strong baselines across three tasks:
dark humor detection, target identification, and intensity prediction. The
dataset, annotations, and code are released to facilitate further research in
multimodal humor understanding and content moderation. Code and Dataset are
available at:
https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
comment: Accepted at IEEE International Conference on Data Mining (ICDM) 2025